What are Character Language Models?

Character language models estimate the likelihood of a given text. They are trained using samples of text and then are able to match other text that is like the text they saw in training.

What's in this Tutorial?

This tutorial shows how to tune and evaluate character language models on a corpus. The tutorial may be run on data supplied with the LingPipe demos.

Online Training and Evaluation

This tutorial illustrates how language model training and evaluation may be interleaved with dynamic language models. Language models may be compiled through LingPipe's generic compilable interface; the compiled models are relatively small in memory and very fast to evaluate.

Scaling to Large Corpora

The motivation for LingPipe's implementation is to be able to scale to large training data sizes. We show performance on two widely available multi-gigabyte corpora:

Applications

Character language models are used in LingPipe for a broad range of applications.

Classification

For classification, character LMs model the type of text being classified. We provide two tutorials in which character language models are used implicitly for classification:

They may also be used for classification by language or any other distinction.

Tagging

Two of the taggers in LingPipe use character language models to learn the forms of words used in various ways.

Spelling Correction

Character language models represent the kinds of queries people are likely to make over a corpus by modeling the corpus.

Running the Demo

To run the tutorial out of the box, cd to the lingpipe/demos/tutorial/lm directory and run the command:

> ant demo

which prints the following output as well as writing it to the file lingpipe/demos/tutorial/lm/demo-out-8.txt:

RUN PARAMETERS
CORPUS NAME=medsamp
FILE EXTENSION=medsamp2005.xml
INPUT DIRECTORIES=[../../data/]
TEXT PARSER CLASS=com.aliasi.corpus.parsers.MedlineTextParser
MAX TRAINING CHARS=9223372036854775807
MAX NGRAM=8
NUM CHARS=256
LAMBDA FACTORS=[2.0,  8.0,  32.0]
SAMPLE SIZE=5000
SAMPLE FREQUENCY=1
PRINT FREQUENCY=5000
REPORT WRITTEN TO FILE=demo-out-8.txt

LEARNING CURVE
#CHARS, ELAPSED(s), TOTAL_MEM(MB), FREE_MEM(MB), TOT-FREE(MB),
MEAN(2.0), DEV(2.0), MEAN(8.0), DEV(8.0), MEAN(32.0), DEV(32.0)
# Visiting directory=../../data
 5000, 1, 232, 225,  7,    5.771, 3.474,    5.530, 2.916,    5.736, 2.404
10000, 2, 232, 220, 12,    3.210, 3.483,    3.041, 2.562,    3.507, 2.002
15000, 2, 232, 215, 17,    1.908, 3.066,    1.967, 2.139,    2.614, 1.632
20000, 2, 232, 212, 20,    3.307, 3.779,    2.920, 2.695,    3.148, 2.074
25000, 3, 232, 207, 25,    2.570, 3.500,    2.432, 2.537,    2.832, 2.019
30000, 3, 232, 202, 30,    3.054, 3.595,    2.705, 2.537,    2.933, 1.972
35000, 3, 232, 198, 34,    3.137, 3.565,    2.773, 2.609,    2.958, 2.086
40000, 4, 232, 195, 37,    3.207, 3.698,    2.847, 2.706,    3.035, 2.174
45000, 4, 232, 190, 42,    3.463, 3.915,    3.051, 2.903,    3.180, 2.366
50000, 4, 232, 185, 47,    2.940, 3.685,    2.660, 2.734,    2.911, 2.247
55000, 4, 232, 180, 52,    2.936, 3.524,    2.626, 2.587,    2.853, 2.130
60000, 5, 232, 176, 55,    2.735, 3.480,    2.480, 2.547,    2.724, 2.080
65000, 5, 232, 172, 60,    2.848, 3.605,    2.578, 2.692,    2.802, 2.229

N-GRAM COUNTS
N, #Unique, #Total, %
0    , 1, 65752,  1.000
1,    83, 65752,  0.999
2,  1413, 65705,  0.978
3,  5960, 65658,  0.909
4, 14179, 65611,  0.784
5, 23478, 65564,  0.642
6, 31633, 65517,  0.517
7, 38508, 65470,  0.412
8, 43928, 65423,  0.329

TOP N-GRAMS
N, (N-GRAM,Count)*
0,  "",65752
1,  " ",9536  "e",6567  "t",4668  "i",4266  "a",4071
2,  "e ",1625  "s ",1099  " t",1070  "in",1032  "th",1013
3,  " th",696  "the",635  "he ",552  "ed ",465  " of",432
4,  " the",548  "the ",471  " of ",427  "tion",357  " and",296
5,  " the ",468  " and ",287  "ation",210  "tion ",197  " of t",122
6,  " of th",121  "ation ",115  "of the",113  "f the ",104  " with ",101
7,  " of the",113  "of the ",103  "patient",65  " patien",62  "ion of ",62
8,  " of the ",103  " patient",62  " in the ",57  "tion of ",55  "patients",45

The first few lines print out the parameters, most of which are specified in the build.xml file. These parameters include the name of a corpus, a path and pattern for file extraction, the name of a text parser class to extract text, a limit on the max number of training characters before stopping, an n-gram size specification, a specification of the number of characters in the corpus and test sets, hyperparameters for the model in the form of lambda factors (more on these later), along with sample size and frequency information, and finally, a path to a file to which results are concatenated. After the parameters come the output reports, which we discuss later in detail.

Getting started with AbstractCommand

The code for this example is in the single file src/TrainLM.java. The first thing to note is that the class extends AbstractCommand. The abstract command class provides methods for parsing command-line arguments that it gathers through its constructor and making them available at run time in various types with various error-checking provided. The main(String[]) method simply constructs a new instance with the command-line arguments as an argument and calls the run() method on the constructed class:

public static void main(String[] args) {
    new TrainLM(args).run();
}

This is the usual pattern for AbstractCommand extensions and can be seen throughout the command tutorials. The first lines of the constructor call the superclass constructor with the static default parameters properties object:

public TrainLM(String[] args) {
    super(args,DEFAULT_PARAMS);
...
}

As usual with abstract commands, constants are used for flag and property parameter names, and default values are set on a Properties instance to be passed to the AbstractCommand(String[],Properties) constructor, as shown above:

static final String CORPUS_NAME_PARAM = "corpusName";
...
static final String LAMBDA_FACTORS_PARAM = "lambdaFactors";
...
static final String REPORT_FILE_PARAM = "reportFile";
static final Properties DEFAULT_PARAMS = new Properties();
static {
    DEFAULT_PARAMS.setProperty(MAX_NGRAM_PARAM,"5");
    DEFAULT_PARAMS.setProperty(SAMPLE_SIZE,"1000");
    ...
}

This just says that the default maximum n-gram is 5, the default sample size is 1000 characters, and so on.

A quick look at the build.xml file shows how the parameters are invoked with the java Ant task:

<java classname="TrainLM"
         fork="true">
   <classpath refid="classpath.standard"/>

   <arg value="-corpusName=medsamp"/>
   ...
   <arg value="-lambdaFactors=2.0, 8.0, 32.0"/>
   ...
   <arg value="-reportFile=demo-out-8.txt"/>
   <arg value="../../data/"/>
</java>

The ellipses contain further parameters in the build file. This would translate into a one-line command:

java TrainLM \
     -corpusName=medsamp ... "-lambdaFactors=2.0, 8.0, 32.0" ... \
     -reportFile=demo-out-8.txt ../../data/

Lines in the code that access these values are as follows:

File outFile = getArgumentFile(REPORT_FILE_PARAM);
...
mNGram = getArgumentInt(MAX_NGRAM_PARAM);
...
String[] lambdaNames = getCSV(LAMBDA_FACTORS_PARAM);

Each of these assignments uses a different accessor inherited from AbstractCommand; of particular note is the method getCSV(String) which parses an argument as a comma-separated list. All of these methods will throw an illegal argument exception if the parameter is not defined either by default or in the command invocation.

All of the parameters above are specified on the command line as -paramName=value. The abstract command also allows an arbitrary number of unmarked arguments, such as the path ../../data/ in the above invocation. These are accessed as follows:

for (int i = 0; i < numBareArguments(); ++i) {
    File dir = new File(getBareArgument(i));

The only other issue of note for AbstractCommand is that it specifies an abstract method run() that does not take any arguments, return any values, or throw any checked exceptions. This calls for an implementation of the following form:

public void run() {
    try {
      train();
    } catch (Exception e) {
        println("Exception=" + e);
        e.printStackTrace(System.out);
        e.printStackTrace(mPrinter);
    } finally {
        Streams.closeWriter(mPrinter);
    }
}

All exceptions are caught and their stack-trace printed both to the file (through mPrinter) and to System.out (which is the standard console output unless it's been reset). And there is a finally block to close the print writer that is writing to a file, making sure the file handle is released if there is an exception in processing locally.

Getting started with Parser and Handler

The com.aliasi.corpus.Parser abstract class and com.aliasi.corpus.Handler interfaces follow the same pattern as the org.xml.sax.XMLParser and org.xml.sax.ContentHandler interfaces for parsing XML documents with SAX. Each parser will contain a handler to which it supplies callbacks derived from parsing an input source. LingPipe's arrangement is more generic than SAX in that the parser interface abstracts over a range of different handler interfaces (when LingPipe converts to generics, this will be generic).

For this demo, we're interested in parsers that provide events for a text handler. With an com.aliasi.corpus.ObjectHandler<E>, there is a single method for performing an arbitrary operation on an object:

public void handle(E e);

In our case, we set E to be CharSequence for handling text, yielding the instantiated method:

public void handle(CharSequence cSeq);

A text parser is a Parser that performs callbacks to its contained handler, in this case an instance of TextHandler. In our running example, a TextHandler implementation will perform language model training -- the exact implementation is considered in the next section.

On the command line, we specify the name of the class that implements the text parser:

-textParser=com.aliasi.corpus.parsers.MedlineTextParser

This is a parser that knows how to extract the text of abstracts from MEDLINE citations in XML format; the XML will be provided through files (found through command-line specification of directory and file suffix).

The text parser is constructed through the com.aliasi.util.Reflection helper class:

String textParserClassName
    = getExistingArgument(TEXT_PARSER_PARAM);
mTextParser
    = (Parser<ObjectHandler<CharSequence>>) Reflection
               .newInstance(textParserClassName)
...

Training the Estimator

The actual estimator is declared as a member variable:

NGramProcessLM mLM;

and set in the constructor using command-line parameters:

mLM = new NGramProcessLM(mNGram,mNumChars);

This is the basic language model estimator class; it knows how to train a language model given text samples. The parameters in its constructor are the n-gram size and the total number of characters in the training and test collections.

The actual work of training a language model is performed by an inner class implementing TextHandler:

class TrainingHandler implements ObjectHandler<CharSequence> {
    public void handle(CharSequence cSeq) {        ...
        mLM.train(cSeq);
    }
}

The ellipses are filled with code that perform a learning curve evaluation versus various parameter settings for the model.

The training text handler is assigned to the text parser using the generic setHandler(Handler) method:

mTextHandler = new TrainingHandler();
...
mTextParser.setHandler(mTextHandler);

The actual training files are extracted from the command line into an array and passed to the method trainFile(File):

File[] files = ...
for (int j = 0; j < files.length; ++j)
    trainFile(files[j]);

This method simply branches to create an input source based on whether the file's suffix indicates it's gzipped:

void trainFile(File file) throws IOException {
    String fileName = file.getName();
    if (fileName.endsWith(".gz"))
        trainGZipFile(file);
    else
        trainTextFile(file);
}

Training a text file is as simple as converting the file name to a URL (using the utility in com.aliasi.util.Files and wrapping it in an input source to be passed to the parser:

void trainTextFile(File file) throws IOException {
    String url = Files.fileToURLName(file);
    InputSource in = new InputSource(url);
    mTextParser.parse(in);
}

Gzipped files are only a bit more complex, constructing their input source through the streamed gzipped decoder:

void trainGZipFile(File file) throws IOException {
    System.out.println("# Training gzip file=" + file
                       + " [char count=" + mCharCount + "]");
    FileInputStream fileIn = null;
    BufferedInputStream bufIn = null;
    GZIPInputStream gzipIn = null;
    try {
       fileIn = new FileInputStream(file);
       bufIn = new BufferedInputStream(fileIn);
       gzipIn = new GZIPInputStream(bufIn);
       InputSource in = new InputSource(gzipIn);
       mTextParser.parse(in);
    } finally {
       Streams.closeInputStream(gzipIn);
       Streams.closeInputStream(bufIn);
       Streams.closeInputStream(fileIn);
    }
}

Note that the streams are closed using the exception-swallowing close-method in the com.aliasi.util.Streams class.

Online Learning Curve Estimation

With more training data, our language models become more accurate in the sense of providing better estmiates of the likelihood of a character given some previous characters. The example run above shows a plot of performance versus amount of training data and parameter settings. The n-gram model actually provides n-gram estimates for all n up to the maximum specified in the model's constructor. The hyperparameters are a comma-separated value on the command line; in the demo target, "-lambdaFactors=2.0, 8.0, 32.0".

Technical Note

These parameters determine the relative weight of higher-order n-grams to lower order n-grams. The interpolation also depends on the number of training events seen for an outcome (the more, the higher the weight), and the number of outcomes seen for a given context (the more, the lower the weight for that context). The JavaDoc contains precise mathematical definitions. We need to run some experiments in order to tune the hyperparameter setting.

The hyperparameters are extracted into a double[]-typed member variable mLambdas. The full handler code for the training handler is:

public void handle(CharSequence cSeq) {
    char[] cs = cSeq.toString().toCharArray();
    int start = 0;
    int length = cSeq.length();
    for (int i = 1; i <= length; ++i) {
        ++mCharCount;
        if (mCharCount > mMaxTrainingCharCount)
            exit();
        if ((mCharCount % mSampleFrequency) != 0)
            continue;
        for (int j = 0; j < mLambdas.length; ++j)
            mSamples[j][mSampleIndex]
                = -mLM.log2ConditionalEstimate(cs,start,
                                               i,nGram,
                                               mLambdas[j]);
            ++mSampleIndex;
            if (mSampleIndex == mSamples[0].length) {
                report();
                mSampleIndex = 0;
            }
        }
        mLM.train(cSeq);
    }
}

This method is simply iterating through the characters passed into training before performing the training. For each character, it increments the overall character counter mCharCount. The second line checks if we've exceeded the maximum counts and calls the top-level exit() method if so. The next line says to simply go on if the character count is not an even multiple of the sample frequency. Finally, for characters being sampled, an estimate is computed for each hyperparameter value. The method called on the language model is:

mLM.log2ConditionalEstimate(cs,start,i,mNGram,mLambdas[j])

This returns the log (base 2) of the estimate of character cs[i-1] given the characters cs[start]...cs[i-2], with the global n-gram length and the j-th hyperparameter.

These estimates are stored in the two dimensional double array mSamples, indexed by hyperparameter and sample index. The sample index counter is incremented, and if it the sample buffer is full (the sampel index is equal to the length of the array), then a report is generated and the sample index is reset. Finally, after all the estimation is carried out on the characters being handled, they are used for training.

void report() {
    print(Long.toString(mCharCount));
    print(", " + ((System.currentTimeMillis() - mStartTime)/1000l));
    long totalMem = mRuntime.totalMemory();
    long freeMem = mRuntime.freeMemory();
    print(", " + totalMem/1000000l);
    print(", " + freeMem/1000000l);
    print(", " + (totalMem-freeMem)/1000000l);
    for (int i = 0; i < mLambdas.length; ++i) {
        double xEntropy = Statistics.mean(mSamples[i]);
        double dev = Statistics.standardDeviation(mSamples[i]);
        print(",   " + decimalFormat(xEntropy) + "," + decimalFormat(dev));
    }
    println("");
}

This method reports on a single line. First, the number of characters, then elapsed time in seconds. The next three reports are for total, free and used memory in megabytes. Then, for each hyperparameter, the mean and deviation of the samples are computed using the methods in com.alias.stats.Statistics. The mean of log (base 2) sample estimates is known as the sample cross-entropy. The print(String) method called here simply calls a print to both standard output and the contained file.

Final Counts Structure Report

The n-gram language models work by keeping track of all substring counts up to the specified length. These are stored in a trie structure, TrieCharSeqCounter, for easy access. The second report in the output above is for n-gram counts at the end of training. There are columns for the n-gram length, number of unique n-grams seen of that length, total number of n-grams seen of that length, and the percentage of n-grams that were seen a second time. This percentage is an indication of how often the character being estimated had been previously observed in the same context. For instance, the 5-gram row indicates 18,221 unique 5-grams and 58108 total 5-grams, which means that 68.6% of the 5-gram cases being estimated after 65,000 characters had previously been seen. The reason there aren't as many longer n-grams as shorter n-grams is due to boundary conditions on the text. For instance, the text "abcde" has 4 bigrams, 3 trigrams, 2 4-grams and 1 5-gram. The code to extract this depends on a single method:

TrieCharSeqCounter counter = mLM.substringCounter();
long[][] uniqueTotals = counter.uniqueTotalNGramCount();

which extracts all the counts through the substring counter. This array indexes by n-gram length, providing pairs of unique counts and total coounts; they're just printed in a loop:

for (int i = 0; i < uniqueTotals.length; ++i) {
    long unique = uniqueTotals[i][0];
    long total = uniqueTotals[i][1];
    double avg = 1.0 - ((double)unique)/(double)total;
    println(i + ", " + unique + ", " + total + ", " + decimalFormat(avg));
}

Note that the single character n-gram unique count is 83, indicating that there are 83 distinct characters in the model.

Top N-grams

The next report is just the top 5 n-grams of each order and their counts. These are presented in CSV-form, with quotes around each entry and all contained quotes escaped. The code to extract the top n-grams is just a single call to the character sequence counter:

TrieCharSeqCounter seqCounter
    = mLM.substringCounter();
for (int i = 0; i <= mNGram; ++i) {
    ObjectToCounter topNGrams
        = seqCounter.topNGrams(i,5);
    ...

The result is returned as an instance of com.aliasi.util.ObjectToCounter, which provides a mutable mapping of objects to counts. Among its features is the ability to extract the keys (n-grams) ordered by count:

Object[] keysByCount = topNGrams.keysOrderedByCount();

It is then a simple matter to loop over the keys and print the keys in comma-separated-value format (quoted and with internal quotes escaped):

for (int j = 0; j < keysByCount.length; ++j) {
    String nGram = keysByCount[j].toString();
    int count = topNGrams.getCount(keysByCount[j]);
    String csvNGram
        = '"' + nGram.replaceAll("\"","\\\"") + '"';
    ...

Running Larger Data Sets: Gigaword and MEDLINE

The build.xml build file contains two additional targets, medline and gigaword. For both of these demos, the data should be downloaded and stored in gzipped format and the ant targets modified to point to the download location that contains the gzipped data files.

Time and Space

With large n-gram lengths, these jobs will take hours to run and require gigabytes of memory. For example, running MEDLINE 8-grams on the full set of MEDLINE abstracts without pruning required an 8GB Java heap. They can be run for smaller samples or shorter n-grams with less memory.

Pruning

In order to train on very large amounts of data, the models may be pruned incrementally. This is done by retrieving the sequence counter for the language model and calling its prune(int) method. This will remove all sequence counts below the specified value. This usually reduces the size of the model substantially even with the smallest possible threshold, 2. Pruning efficiency follows from Zipf's Law, which says that the frequency of a word (or in our case, sequence of characters) is inversely proportional to its rank (ordered by frequency).

Generating Pretty Graphs

The report lines are formatted as comma-separated values (CSV) to enable exporting to a spreadsheet or to our favorite free, open source graphing package, Gnuplot. For instance, the graphs in the whitepaper reference were generated this way by combining reports for different n-gram lengths into a single plot.

Scaling Token Language Models

This demo contains code to serialize token language models to files. It also contains code to merge files produced in this way. Because neither stage requires much memory, the two functions may be used together to scale tokenized language models to arbitrary amounts of data.

Serializing Token Language Models

The serialization and merge functionality is provided in src/TokenNGramFiles.java.

To illustrate the three public static methods in TokenNGramFiles, we have provided a demo main method in src/DemoTokenNGramFiles.java.

Running the Demo

The demo may be run from ant using:

$ ant tok-io

The output is provided in four gzipped files:

$ ls *.gz
temp1.lm.gz  temp12.lm.gz  temp2.lm.gz  tempIo.lm.gz

where temp1.lm.gz and temp2.lm.gz contain the counts for sequences in the even and odd numbered lines of the input respectively, where temp12.lm.gz contains counts merged through files, and tempIo.lm.gz is the result of reading in and then printing back out the merged file.

The input arguments for the demo are:

  <arg value="A TrieCharSeqCounter stores counts of sub strings of strings."/>
  <arg value="zz zz zz xx"/>

  <arg value="A TrieCharSeqCounter stores counts of sub strings of strings."/>
  <arg value="aa aa aa aa aa bb bb bb cc cc dd"/>

To inspect the output, we first unzip the files:

demos/tutorial/lm> gunzip *.gz

Here are the contents of files 1, 2 and 3:

1212
 13
<S> 3
<S> aa 1
<S> aa bb 1
<S> cc 1
<S> cc dd 1
aa 1
aa bb 1
aa bb bb 1
bb 2
bb bb 1
bb bb cc 1
bb cc 1
bb cc dd 1
cc 4
cc <S> 1
cc dd 3
cc dd <S> 1
cc dd cc 2
dd 3
dd <S> 1
dd cc 2
dd cc <S> 1
dd cc dd 1
 12
<S> 3
<S> dd 2
<S> dd ee 2
dd 4
dd <S> 1
dd dd 1
dd dd <S> 1
dd ee 2
dd ee dd 1
dd ee ff 1
ee 4
ee <S> 1
ee dd 1
ee dd dd 1
ee ee 1
ee ee <S> 1
ee ff 1
ee ff ee 1
ff 1
ff ee 1
ff ee ee 1
 25
<S> 6
<S> dd 2
<S> dd ee 2
bb 2
cc 4
cc dd 3
cc dd cc 2
dd 7
dd <S> 2
dd cc 2
dd ee 2
ee 4

The counts for the even rows (0 and 2) are in column 1, those for the odd rows (1 and 3) in column 2, and the merged counts, pruned to a minimum count of 2, is shown in the third column.

Code Walkthrough

The demo code for carrying all of this out is contained in a short main method in DemoTokenNGramFiles:

public static void main(String[] args) throws IOException {
    ...
}

The first step simply defines the tokenizer factory to use and constructs a tokenized language model:

  TokenizerFactory tokenizerFactory
    = IndoEuropeanTokenizerFactory.INSTANCE;
  TokenizedLM lm1 = new TokenizedLM(tokenizerFactory,3);

This will store token sequences up to length 3 based on the Indo-European tokenizer factory. Of course, other tokenizer factories, such as ones based on customized regular expressions, will return different results. The use of this tokenizer factory is just for simplicity here.

Then the language model is trained on the odd input lines:

  for (int i = 0; i < args.length; i += 2)
      lm1.handle(args[i]);

Next, the model is written to the file temp1.lm.gz using a static method from TokenNGramFiles:

  File file1 = new File("temp1.lm.gz");
  TokenNGramFiles.writeNGrams(lm1,file1,0,3,1,"UTF-8");

The (unzipped) contents written to temp1.lm.gz is shown in column 1 of the table above. The parameters 0,3 indicate the length of n-grams to store, here 0-grams to 3-grams. The 1 argument indicates the minimum count for n-grams to be stored. The string "UTF-8" specifies UTF-8 encoded Unicode as a character encoding.

In the second step, the second LM is trained on the odd-numbered lines and written to file in the same way.

  TokenizedLM lm2 = new TokenizedLM(tokenizerFactory,3);
  for (int i = 1; i < args.length; i += 2)
      lm2.train(args[i]);
  File file2 = new File("temp2.lm.gz");
  TokenNGramFiles.writeNGrams(lm2,file2,0,3,1,"UTF-8");

After writing out the n-grams computed from alternating lines of the input, they are merged using the method TokenNGramFiles.merge():

  File file12 = new File("temp12.lm.gz");
  TokenNGramFiles.merge(Arrays.asList(file1,file2),
                        file12,
                        "UTF-8",
                        2, false);

The merged n-grams are written (in gzipped form) to temp12.lm.gz. The arguments to the merge() method consist of a list of files to merge (which may be of any size), followed by a target file and encoding, followed by a minimum count for pruning and a boolean flag indicating whether to delete the input files after the merge completes. Here, we left the input files so they could be inspected.

After unzipping, the contents of the merged file are shown in the third column of the table above. Note that any string with combined count lower than 2 is removed. The string dd <S> has count 1 in both files, which gives it a combined count of 2, so it is preserved in the output.

The fourth stage of the demo reads the combined file into a new language model using TokenNGramFiles.addNGrams():

  TokenizedLM lmIo
      = new TokenizedLM(tokenizerFactory, 3,
                        lm1.unknownTokenLM(), lm1.whitespaceLM(), lm1.lambdaFactor(),
                        false);
  TokenNGramFiles.addNGrams(file12,"UTF-8",lmIo,0);
  File fileIo = new File("tempIo.lm.gz");
  TokenNGramFiles.writeNGrams(lmIo,fileIo,1,3,2,"UTF-8");

First, a tokenized language model is created using the six-argument constructor. This is done so that the final argument can be specified as false, so that boundaries are not auto-incremented during construction. This allows exactly the count of n-grams in the files to be read. The same tokenizer factory is used as before, the maximum n-gram is still 3. For convenience, we use the same unknown token, whitespace and interpolation variables as in the first model (these are specified by default as constants when the two-argument constructor was used).

Next, the addNGrams() method is called, which reads the counts from the specified file using the specified encoding and adds them to the specified language model after pruning at the specified level.

For inspection, we finally write the n-grams from this new model as before to tempIo.lm.gz. The result is identical to temp12.lm.gz (as verified by the diff command), the contents of which are in the third column of the table above.

Warning on Tokenizers

Tokens must not contain unicode characters with code points at or below U+0020 (e.g. no spaces, line feeds, or carriage returns in tokens).

MEDLINE Token N-Grams

In this section, we illustrate the class that performs map-reduce-like processing for scaling indexers, TokenNGramIndexer, the code for which is in src/TokenNGramIndexer.java.

For an interface, the class implements corpus.TextHandler, which provides a single void handle() method accepting a character slice. The text is tokenized and buffered in an in-memory language model. Before an in-memory character threshold is passed, the model is written out to level zero file. When the number of files of a given level exceeds a threshold, they are merged into a file of the next level (which may implicate further merges, all of which are carried out simultaneously). In the end, the remaining files are merged into a single output file. In this way, token n-gram counts may be scaled to arbitrary inputs, with the only limit being processing time (which may be further distributed over multiple machines, leading to a final meta-merge).

MEDLINE Demo Data

Although this code was built to count n-grams in all of MEDLINE, this demo will restrict attention to the reasonably sized public samples available from:

It will contain the current year sample files. For 2009, there are seven sample files, named medsamp2009a.xml.gz through medsamp2009b.xml.gz, ranging in size from 4 to 20 megabytes (gzipped), most of which is irrelevant XML for this demo.

For parsing MEDLINE, we use the LingPipe package com.aliasi.medline, which contains a full object-oriented MEDLINE parser, medline.MedlineParser, which sends citations to a medline.MedlineHandler. We will simply extract the text from a citation and send it to the indexer.

Running the Demo

The demo may be run from ant, using the following command (warning: takes half an hour on a decent but not state of the art notebook circa 2009):

$ ant med-n

MEDLINE Sample Data Directory=c:\data\medline\samples
Index Directory=temp-medline-idx
 Data File=c:\data\medline\samples\medsamp2009a.xml.gz
 Data File=c:\data\medline\samples\medsamp2009b.xml.gz
 ...

After the task is run, the index will be in the specified index directory in gzipped format, just like the output of the previous demos. For the 2009 MEDLINE release, that's a 25,443,818 byte gzipped file of counts at level 5_1. It may be read back into models, etc.

Code Walkthrough

The MEDLINE-specific portion of the demo is in src/MedlineNGrams.java. The main method starts by setting up the directories and parameters for the token n-gram indexer:

public static void main(String[] args)
    throws IOException, SAXException {

    File medlineDir = new File(args[0]);
    File indexDir = new File(args[1]);

The data directory containing the gzipped MEDLINE files and the index directory to which files are written are given as fixed command-line arguments. Next, the other parameters are assigned:

    int maxCharsBuffered = 4 * 1024 * 1024;
    int maxFilesPerLevel = 2;
    TokenizerFactory tokenizerFactory
        = IndoEuropeanTokenizerFactory.INSTANCE;
    int nGramOrder = 5;
    String encoding = "UTF-8";

Then the indexer itself is created, based on all of these parameters:

    final TokenNGramIndexer indexer
            = new TokenNGramIndexer(maxCharsBuffered,
                                    maxFilesPerLevel,
                                    tokenizerFactory,
                                    nGramOrder,
                                    indexDir,encoding);

We also create a sentence chunker before parsing, which like the indexer is final to allow incorporation into an upcoming anonymous inner class:

    final Chunker sentenceChunker
        = new SentenceChunker(tokenizerFactory,
                              new MedlineSentenceModel());

Note the use of a special model for MEDLINE sentence chunking given in com.aliasi.sentences.MedlineSentenceModel.

The work is done in a MEDLINE citation handler, with a handle method and a do-nothing delete which will not be called here.

    MedlineHandler medlineHandler = new MedlineHandler() {
        public void handle(MedlineCitation citation) {
            ...
        }
        public void delete(String id) { }
    }

The handle method just grabs the text of each sentence from the title and body and sends it to the indexer:

        Article article = citation.article();
        String title = article.articleTitleText();
        char[] cs = title.toCharArray();
        indexer.handle(cs,0,cs.length);

        Abstract abstrct = article.abstrct();
        if (abstrct == null) return;
        String text = abstrct.textWithoutTruncationMarker();
        Chunking sentenceChunking
            = sentenceChunker.chunk(text);
        Set<Chunk> sentenceChunks
            = sentenceChunking.chunkSet();
        for (Chunk chunk : sentenceChunks) {
            int start = chunk.start();
            int end = chunk.end();
            String sentence = text.substring(start,end);
            cs = sentence.toCharArray();
            indexer.handle(cs,0,cs.length);
         }
         ...

The rest of the code simply parses the files out of their gzipped distribution format and sends them to LingPipe's MEDLINE parser, with the handler defined above attached. It looks as follows, with the gory details of selecting the file, creating the input source, and cleaning up in an orderly fashion elided:

        MedlineParser parser = new MedlineParser(includeRawXml);
        parser.setHandler(medlineHandler);
        for (File file : ...) {
            ...
            InputSource in = new InputSource(...);
            in.setEncoding("UTF-8");
            parser.parse(in);
         }
         indexer.close();
         int minFinalCount = 2;
         indexer.optimize(minFinalCount);

Token N-Gram Indexer

The rest of the work is done in the token n-gram indexer, the code for which is in src/TokenNGramIndexer.java. It contains the methods required to buffer n-grams in a tokenized LM and write to the file system when necessary. It also carries out merges if the number of files at a given level would exceed the maximum allowed. All of these operations simply call the token n-gram files commands in TokenNGramFiles.

References