Why are Chinese Words Hard?

Unlike Western languages, Chinese is written without spaces between words. Thus to run any word- or token-based linguistic processing on Chinese, it is first necessary to determine word boundaries. This tutorial shows how to segment Chinese into words based on LingPipe's spelling corrector.

How's it Done?

The basic idea is to treat the lack of space between tokens as spelling "mistakes" which the spelling corrector will "correct" with the insertion of spaces.

Who Thought of Doing it This Way?

This is just another way of looking at the compression-based approach of Bill Teahan et al.'s. See the references for more details.

1. Downloading Training Corpora

Luckily for us, there are three publicly available training corpora for Chinese segmentation made available as part of the First International Chinese Word Segmentation Bakeoff. The Second Bakeoff was held in 2005, but the training data is not publicly available. All further discussion will be of the first bakeoff. These bakeoffs are sponsored by SigHan, the Chinese special interest group (SIG) of the Association for Computational Linguistics (ACL).

Step one for the tutorial is to download the test and training data from the six links below (after noting that the data are made available for research purposes only as stated on this page):

First International Chinese Word Segmentation Bakeoff Data Links & Content
Corpus Creator Training Testing Encoding # Train Words # Test Words
Academia Sinica (AS) Training Data (11.8M) (mirror) Testing Data (60K) (mirror) CP950 5.8M 12K
HK CityU (HK) Training Data (500K) (mirror) Testing Data (150K) (mirror) CP936 240K 35K
Peking University (PK) Training Data (2.3M) [no longer live] Testing Data (90K) [no longer live] Big5_HKSCS 1.1M 17K

Place all six of these files (without unzipping the .zip files) into a directory. We'll call the directory containing the data dataDir after the Ant property we will use to specify it.

2. Running the Evaluations

Once the code is compiled, there are three ant tasks which can be used to run the evaluations. Running these scripts produces standard output as well as a file of official evaluation results.

To run the Hong Kong City University training sets, first cd to the demo directory:

cd lingpipe/demos/tutorial/chineseTokens

Then you can either run the evaluation from Ant by specifying the location of the data directory on the command line

ant -DdataDir=dataDir run-cityu

or directly via the following command (with the name you chose for your data directory substituted for dataDir, and replacing the colons ":" with semicolons ";" if you are using Windows):

java -cp "../../../lingpipe-3.9.3.jar;zhToksDemo.jar" ChineseTokens dataDir cityu hk cityu.out Big5_HKSCS 5 5.0 5000 256 0.0 0.0

For example, on my machine, we downloaded the six files to e:\data\chineseWordSegBakeoff03, so we can run as follows (please be patient during compilation -- it takes eight minutes or so on my desktop):

> java -cp "../../../lingpipe-3.9.3.jar;zhToksDemo.jar" ChineseTokens e:\data\chineseWordSegBakeoff03 cityu hk cityu.out Big5_HKSCS 5 5.0 5000 256 0.0 0.0
CHINESE TOKENS DEMO
    Data Directory=e:\data\chineseWordSegBakeoff03
    Train Corpus Name=cityu
    Test Corpus Name=hk
    Output File Name=e:\data\chineseWordSegBakeoff03\cityu.out.segments
    Known Tokens File Name=e:\data\chineseWordSegBakeoff03\cityu.out.knownWords
    Char Encoding=Big5_HKSCS
    Max N-gram=5
    Lambda factor=5.0
    Num chars=5000
    Max n-best=256
    Continue weight=0.0
    Break weight=0.0
Training Zip File=e:\data\chineseWordSegBakeoff03\cityu_training.zip
Compiling Spell Checker
Testing Results. File=e:\data\chineseWordSegBakeoff03\hk-testref.txt
  # Training Toks=23747  # Unknown Test Toks=1855
  # Training Chars=3649  # Unknown Test Chars=89
Token Length, #REF, #RESP, Diff
    1, 16867, 17267, 400
    2, 15058, 14740, -318
    3, 2126, 2072, -54
    4, 703, 721, 18
    5, 82, 112, 30
    6, 71, 85, 14
    7, 19, 29, 10
    8, 12, 15, 3
    9, 5, 5, 0
Scores
  EndPoint: P=0.9748424085113665 R=0.9777148415150475 F=0.9762765121759623
     Chunk: P=0.935963260881967 R=0.9387212129881276 F=0.9373402082470399

Reading the Output

Hopefully this output is fairly easy to interpret. The first few lines just parrot back the input parameters. We will describe these as we go through the code in the demo. Then there's a note to say that the training is being done using the specified zip file. Training the language model is fairly quick. There's a bit of a wait after the message that says the spell checker is being compiled. That's because highly branching character language models like those for Chinese are slow to compile in LingPipe (this may be optimized in a later version -- the slowness derives from repeatedly summing the counts of the daughter of a node). Then there's a note to say testing is going on and echoing the test file.

Descriptive Token Statistics

The next two lines provide a report on the number of training tokens and characters, along with the number of unknown test tokens and test characters. A token is said to be "unknown" if it appears in the test data without appearing in the training data. There were 89 unknown characters and 1855 unknown tokens in the Hong Kong City University test data.

A file is also populated with the known tokens, one per line. These are put in the file indicated in the output, which goes in the data directory with a suffix .knownWords.

The next few lines provide histograms of token length in the reference (training data) and response (system output), as well as the difference. For instance, the training data contained 15,058 tokens of length two, whereas the output produced only 14,740 tokens, a difference of -318. Our system is producing too many outputs of length 1 and too few outputs of length 2 and 3, and then too many outputs again of lengths longer than 3.

Precision and Recall Results

In addition to all of these descriptive statistics, two sets of precision, recall and f-measure scores are presented for the run. The first of these measures precision and recall of endpoints. The second measures the precision and recall of the words themselves. (This is the same pair of evaluations as we used in the sentence demo. Our chunk scores are computed the same way as the official scoring script from the bakeoff, for which the top scoring system on this corpus had scores of P=0.934, R=0.947, F=0.940 (vs. our P=0.936, R=0.939, F=0.937). Interestingly, computing binomial confidence intervals for these results yields a 95% confidence interval of +/-0.003). Thus we conclude that our approach is a reasonable one (though we also knew about Bill Teahan's paper cited in the references below).

Official Scoring Script

The run also produces an output file cityu.out, as specified on the command line. This file acts as official output; it's what would be sent back to the organizers if we were in time to actually enter the bakeoff.

We've included the original scoring script with this distribution. It can be run on the output relative to a dictionary of known and unknown words. To run it, the following invocation works, assuming you have the Perl scripting language installed along with the command diff (these are typically installed with Linux distributions; we'd recommend the CygWin distribution of unix tools for MS Windows users).

With Perl installed, it's easy to run the official script. It's just:

perl bin\score.pl knownWords responseSegments testFile

which for our output named cityu.out and data directory e:\data\chineseWordSegBakeoff03 yields the command:

perl bin\score.pl e:\data\chineseWordSegBakeoff03\cityu.out.knownWords e:\data\chineseWordSegBakeoff03\hk-testref.txt e:\data\chineseWordSegBakeoff03\cityu.out.segments

This prints an analysis per test sentence with the actual diff of the response and reference segments. speak Chinese, so it's all Greek to us. Looking at the tail of the file shows us this:

SUMMARY:
TOTAL INSERTIONS:       738
TOTAL DELETIONS:        635
TOTAL SUBSTITUTIONS:    1507
TOTAL NCHANGE:  2880
TOTAL TRUE WORD COUNT:  34955
TOTAL TEST WORD COUNT:  35058
TOTAL TRUE WORDS RECALL:        0.939
TOTAL TEST WORDS PRECISION:     0.936
F MEASURE:      0.937
OOV Rate:       0.071
OOV Recall Rate:        0.542
IV Recall Rate: 0.969

In particular, note that the recall and precision figures reportied here matches our own chunk-level precision, recall, and f measures, namely P=0.930,R=0.939 and F=0.937. The script further goes on to calculate performance on out-of-vocabulary words; the out of vocabulary rate is 7 percent (same as what we calculated), and the performance on out-of-vocabulary tokens is only 54.2%. The top-scoring system for the bakeoff had an out-of-vocabularly recall of 62.5%.

Running Other Corpora

The other corpora can be run in exactly the same way. All that is necessary to change are the names of the corpora, the name of the character encoding, and the output files in the command. Note that the other corpora are larger and take more time to process. Here are the results of running these corpora with zero edit costs, a large enough n-best not to make search errors, and length 5 n-grams (the best performing n-gram size in the evaluation run by Bill Teahan; see the references). In other words, these are completely "out of the box" settings. We'll discuss tuning later.

Chunk-Level Scoring
Corpus Default LingPipe Results Winning Closed Bakeoff Result
Prec Rec F Prec Rec F Winning Site
HK City Uni 0.936 0.937 0.937 0.934 0.947 0.940 Ac Sinica
Beijing U 0.930 0.926 0.928 0.940 0.962 0.951 Inst. of Comp. Tech, CAS
Academia Sinica 0.960 0.969 0.964 0.966 0.956 0.961 UC Berkeley

The results in bold are the best scoring for the respective category. The results have a 95 percent confidence interval of roughly +/-0.003 (differing slightly by performance and amount of training data as described in the Sproat and Emerson paper cited below). For two of the three corpora, Hong Kong City University's and Academia Sinica's, LingPipe's F-score was not significantly different than that of the winner of the bakeoff's.

Their official bakeoff also had an "open" category that allowed external resources to be used for training. There was not an open system submitted for the Academia Sinica corpus that performed better than the closed submissions. The best open-system f-measures for the HK corpus was 0.956, and the best for the PK corpus was 0.959, both significantly better than the closed entries.

The Academia Sinica is the largest corpus at 5.1M training data, and the results on that corpus are similar to what is reported in Bill Teahan's paper (cited in the references) for the proprietary RocLing corpus. Our conclusion is that LingPipe's out-of-the-box performance is state of the art for pure learning based systems.

3. Inspecting The Code

The code for the demo is contained in a single file: src/ChineseTokens.java.

Main and Run

The main program simply creates a new instance from the arguments and calls its run method:

public static void main(String[] args) {
    try {
        new ChineseTokens(args).run();
    } catch (Throwable t) {
        System.out.println("EXCEPTION IN RUN:");
        t.printStackTrace(System.out);
    }
}

Throwables are caught and their stack traces dumped for debugging.

Rather than using a more complex command-line framework, such as LingPipe's util.AbstractCommand, we just pass all the arguments to the constructor for parsing which just sets a bunch of member variables of the appropriate type:

public ChineseTokens(String[] args) {
    mDataDir = new File(args[0]);
    mTrainingCorpusName = args[1];
    mTestCorpusName = args[2];
    mOutputFile = new File(mDataDir,args[3]+".segments");
    mKnownToksFile = new File(mDataDir,args[3]+".knownWords");
    mCharEncoding = args[4];
    mMaxNGram = Integer.parseInt(args[5]);
    mLambdaFactor = Double.parseDouble(args[6]);
    mNumChars = Integer.parseInt(args[7]);
    mMaxNBest = Integer.parseInt(args[8]);
    mContinueWeight = Double.parseDouble(args[9]);
    mBreakWeight = Double.parseDouble(args[10]);
}

The run method just calls the three worker methods in order:

void run() throws ClassNotFoundException, IOException {
    compileSpellChecker();
    testSpellChecker();
    printResults();
}

Training and Compiling

The first worker method encapsulates the training and compilation of a spell checker.

Constructing a Trainer

In order to train and compile the spelling checker, we first construct a training instance out of an n-gram process language model and a weighted edit distance:

void compileSpellChecker() throws IOException, ClassNotFoundException {
    NGramProcessLM lm
        = new NGramProcessLM(mMaxNGram,mNumChars,mLambdaFactor);
    WeightedEditDistance distance
        = new ChineseTokenizing(mContinueWeight,mBreakWeight);
    TrainSpellChecker trainer
         = new TrainSpellChecker(lm, distance,null);
    ...

The n-gram process language model represents the source model for the noisy-channel spelling decoder. It is parameterized by the n-gram size, the number of characters in the underlying training and test set, and an interpolation factor. These are all described in the Language Modeling Tutorial. Each of them may be used to tune performance as indicated below.

The spell checking trainer is constructed from the language model and a weighted edit distance. In this case, the edit distance is an instance of the inner class ChineseTokens.ChineseTokenizing. This is just a generalization of the LingPipe constant CompiledSpellChecker.TOKENIZING that allows for non-zero insert and delete weights. Until we consider tunining in the last section, we will use an instance of ChineseTokenizing that is identical to CompiledSpellChecker.TOKENIZING. That is, the cost of matching is zero, the cost of inserting a single space character is zero, and all other edit costs are negative infinity. In the generalized edit distance, the weights for matching (continuing a token) and inserting a space (ending a token) may be non-zero negative numbers.

The final argument to the TrainSpellChecker constructor is null, meaning that the edits are not going to be restricted to producing tokens in the training data.

Providing Training Instances

The training process itself is just a matter of looping through the lines of the entries in the zip file:

FileInputStream fileIn = new FileInputStream(trainingFile);
ZipInputStream zipIn = new ZipInputStream(fileIn);
ZipEntry entry = null;
while ((entry = zipIn.getNextEntry()) != null) {
    String[] lines = extractLines(zipIn,mTrainingCharSet,mTrainingTokenSet);
    for (int i = 0; i < lines.length; ++i)
        trainer.handle(lines[i]);
}
Streams.closeInputStream(zipIn);

The extractLines(InputStream,Set,Set) takes the input stream from which to read the lines and two sets. The sets are used to accumulate the characters and tokens found in the training sets (and later in the test sets). The extractor is also responsible for normalizing the whitespace to single space characters between tokens and a single line-final space character:

while ((refLine = bufReader.readLine()) != null) {
    String trimmedLine = refLine.trim() + " ";
    String normalizedLine = trimmedLine.replaceAll("\\s+"," ");

The point is to get the normalized lines to the trainer while accumulating some statistics.

Compiling and Configuring the Spell Checker

After the trainer has been trained on all the lines, the spell checker is compiled in-memory in one line using the compile(Compilable) method in util.AbstractExternalizable:

mSpellChecker
    = (CompiledSpellChecker) AbstractExternalizable.compile(trainer);

The spell checker is tuned by the following series of set method calls:

mSpellChecker.setAllowInsert(true);
mSpellChecker.setAllowMatch(true);
mSpellChecker.setAllowDelete(false);
mSpellChecker.setAllowSubstitute(false);
mSpellChecker.setAllowTranspose(false);

mSpellChecker.setNumConsecutiveInsertionsAllowed(1);

mSpellChecker.setNBest(mMaxNBest);

This tells the spell checker that only insert and match edits are allowed, thus saving it the time of inspecting other edits. The second-to-last method call limits the number of consecutive insertions to one; this is because we only care about single-character inserts of spaces. The last method call establishes the maximum number of hypotheses carried over after finishing processing of a character. Higher values cause less search errors whereas lower values are faster. This value would typically be tuned by empirically tuning it to be as low as possible without causing search errors.

Compiling to and Reading from a File

If memory is at a premium or if the model is going to be reused, it may be written to a file rather than compiled in memory. To write a model to a file, it must be wrapped in an object output stream:

File compiledModelFile = ...;

OutputStream out = new FileOutputStream(compiledModelFile)
DataOutput dataOut = new DataOutputStream(out);
trainer.compileTo(dataOut);

The model may then be read back in by reversing the process:

InputStream in = new FileInputStream(compiledModelFile);
ObjectInput objIn = new ObjectInputStream(in);
mSpellChecker = (CompiledSpellChecker) objIn.readObject();

After it is read back in, it can have its runtime parameters set as illustrated above.

Tokenizing

The single execution of the main in ChineseTokens runs a performance evaluation after training the models. The original SigHan bakeoff data is divided into a zip file of training data files and a single test file in the same format. The lines are extracted from the test file in the same way as the training files and then handed off one-by-one to the method test(String). The test method starts as follows:

void test(String reference) throws IOException {
    String testInput = reference.replaceAll(" ","");

    String response = mSpellChecker.didYouMean(testInput);
    response += ' ';
    ...

This simply removes all the spaces from the testinput using the java string method replaceAll. It is then supplied to the spell checker and the first-best "correction" is returned and set into a variable. A final space is appended to match the input format and make evaluating simpler.

Evaluation

The following code is a repetition of the first three lines of the test(String) method:

    String testInput = reference.replaceAll(" ","");
    String response = mSpellChecker.didYouMean(testInput);
    response += ' ';

Bakeoff Output

The next two lines simply write output in the "official" output format.

    mOutputWriter.write(response);
    mOutputWriter.write("\n");

This is the format that will serve as input to the official scoring script. Note that the output writer was allocated to use the same character encoding as the corpus, a requirement of the bakeoff format.

Break Point Evaluation

The first evaluation in the demo is of break points.

    Set<Integer> refSpaces = getSpaces(reference);
    Set<Integer> responseSpaces = getSpaces(response);
    prEval("Break Points",refSpaces,responseSpaces,mBreakEval);

These three lines just get a set of Integer indices of token-final characters in the original input or output. For example:

getSpaces("XXX X XXXX XX") = { 2, 3, 7, 9 }
getSpaces("XXXXXX XX XX") = { 5, 7, 9}

The call to the prEval method in the third line adds the number of true positives, false positives and false negatives to the break evaluation. Here's the method:

void prEval(String evalName, Set<Integer> refSet, Set<Integer> responseSet,
            PrecisionRecallEvaluation eval) {
    for (E e : refSet)
        eval.addCase(true,responseSet.contains(e));

    for (E e : responseSet)
        if (!refSet.contains(e))
            eval.addCase(false,true);
}

This first loops over the reference cases, testing whether or not the case is in the response set. It either calls eval.addCase(true,true), adding a true positive case appearing in the reference and response, or it calls eval.addCase(true,false), adding a false negative case appearing in the reference but not the response. The last loop is through the response set, and it adds a case eval.addCase(false,true) for a false positive for a case that is in the response set but not in the reference set.

At the end of the run, the precision-recall evaluation object can be queried for the precision, recall and f-measure (among other statistics):

System.out.println("  EndPoint:"
                   + " P=" + mBreakEval.precision()
                   + " R=" + mBreakEval.recall()
                   + " F=" + mBreakEval.fMeasure());

This evaluation result tends to be much higher than the chunk evaluation. The reason for this is that chunks that mismatch can lead to multiple false positives and false negatives.

Chunk Evaluation

The evaluation of chunking proceeds in the same way:

    Set<Tuple<Integer>> refChunks = getChunks(reference,mReferenceLengthHistogram);
    Set<Tuple<Integer>> responseChunks = getChunks(response,mResponseLengthHistogram);
    prEval("Chunks",refChunks,responseChunks,mChunkEval);

The method to extract the chunks is a little trickier because it also computes the histogram of token lengths for the reference and response, as seen above in the method calls:

static Set<Tuple<Integer>> getChunks(String xs, ObjectToCounter<Integer> lengthCounter) {
    Set<Tuple<Integer>> chunkSet = new HashSet<Tuple<Integer>>();
    String[] chunks = xs.split(" ");
    int index = 0;
    for (int i = 0; i < chunks.length; ++i) {
        int len = chunks[i].length();
        Object chunk = Tuple.create(new Integer(index),
                                    new Integer(index+len));
        chunkSet.add(chunk);
        index += len;
        lengthCounter.increment(new Integer(len));
    }
    return chunkSet;
}

Here we just split the original input on single spaces, and then add tuples to the return set consisting of tuples (ordered pairs of objects) with values given by the start and end indices of the chunk. For instance:

ref = "XXX X XXXX XX"
resp = "XXXXXX XX XX"

getChunks(ref) = { (0,2), (2,3), (3,7), (7,9) }
getChunks(resp) =  { (0,5), (5,7), (7,9)}

In this case, there is one true positive, (7,9), three false negatives, (0,2), (2,3), and (3,7), and two false positives, (0,5) and (5,7).

Note that the index variable keeps the index into the original character sequence without spaces.

Token Length Histogram

Finally note the increment of the length counter, which provides the final histogram output of token lengths. This is used in the final print out to print the token length histograms using the following code:

System.out.println("Token Length, #REF, #RESP, Diff");
for (int i = 1; i < 10; ++i) {
    Integer iObj = new Integer(i);
    int refCount = mReferenceLengthHistogram.getCount(iObj);
    int respCount = mResponseLengthHistogram.getCount(iObj);
    int diff = respCount-refCount;
    System.out.println("    " + i + ", " + refCount
                       + ", " + respCount + ", " + diff);
}

This prints the reference coutns, response counts, and the error in terms of a difference.

A Statistical Tokenizer Factory

The demo up to this point has just been concerned with an in-memory evaluation. The file src/StatisticalTokenizerFactory.java contains a simple implementation of a tokenizer factory based on a compiled spell checker. The implementation is simple, but not very efficient, because of its reliance on the regular-expression based tokenizer factory. The code is only a few lines:

public class StatisticalTokenizerFactory extends RegexTokenizerFactory {
    private final CompiledSpellChecker mSpellChecker;

    public StatisticalTokenizerFactory(CompiledSpellChecker spellChecker) {
        super("\\s+"); // break on spaces
        mSpellChecker = spellChecker;
    }

    public Tokenizer tokenizer(char[] cs, int start, int length) {
        String input = new String(cs,start,length);
        String output = mSpellChecker.didYouMean(input);
        char[] csOut = output.toCharArray();
        return super.tokenizer(csOut,0,csOut.length);
    }
}

It holds a compiled spell checker in a member variable that's assigned in the constructor. The class extends RegexTokenizerFactory, and the call super("\\s+") in the constructor tells the parent to construct tokens by breaking on non-empty sequences of whitespaces. The actual tokenizer just converts the input to a string, runs the spell checker on it, converts the output to a character array, and returns the result of the parent tokenizer factory. This result is a tokenizer that separates on the spaces inserted by the spell checker as a part of the output.

The character offsets in the tokenizer will refer to positions in the output variable; this could be changed by a tighter implementation of a statistical tokenizer factory that also avoided regular expressions by breaking directly on whitespaces. The output is guaranteed to have only single spaces in the output.

A word of warning is in order about using this tokenizer for tasks like information retrieval. Because it relies on statistical context, the same sequence of characters might not always be tokenized the same way. This can have dire consequences in tasks such as information retrieval if a query and corpus have different tokenizations.

Tuning Statistical Tokenizers

There are a number of performance tuning options that control both speed and accuracy.

N-best Size

The most important speed tuning factor is the size of the n-best list. This should be tuned to where it is as small as possible without causing too many search errors.

Pruning Language Models

With large training data sets, the models get very large. The character language models underlying the spell checker may be pruned just as other language models are.

Language Model n-gram

The most significant tuning parameter that affects both accuracy and performance is the size of the n-grams stored in the source language model. Five seems to be a good setting for this parameter. Longer n-grams are not more accurate, shorter ones are less accurate. Shorter n-grams will result in smaller model files, which can seriously affect run-time memory consumption.

Language Model Interpolation

The interpolation paramemter in the language model affects the degree to which longer contexts are weighted against shorter contexts during language model interpolation. This number is just a parameter in the Witten-Bell smoothing formula that also considers the number of possible outcomes and the number of instances seen. In general, the lower this value, the less smoothing. With less smoothing, the training corpus dominates the statistics. With a higher value there is more smoothing, and more weight is given to possibilities that were not seen in the training data.

Edit Weights

It is most tempting to try to tune edit weights. By making space insertion more costly than 0.0, we can force breaks to be relatively more expensive than continuing (matching) and thus favor longer tokens. Similarly, by making matching more costly than 0.0, breaks are relatively less expensive than continuing, and thus we would favor shorter tokens. These are fairly easy to implement by following the pattern provided by spell.CompiledSpellChecker.TOKENIZING.

As an example, we have added a general such implementation as an embedded class called ChineseTokenizing in the demo. The demo is configured so that the insert and match weights may be configured with the last two command-line arguments.

Unfortunately, our token length errors tend to overestimate one-character tokens, underestimate length two- and three-character tokens, and then overestimate tokens longer than three characters. A less naive length model might help here, but such a model is tricky to integrate with the decoder as is.

Another issue arguing against modifying the edit weights significantly is the endpoint precision and recall, which are roughly balanced. By increasing the insert (break) cost, end point recall would go down, even if precision increased. Similarly, by increasing the match cost (continue), the end point precision is likely to increase at the cost of recall.

Dictionary Training

Given a dictionary of tokens, they may be added (followed by a single space) as training data just like the training data from the corpus, by using the method handle(String). The normalization here should be the same as that for the other lines, reducing all spaces to single spaces and ensuring there is no initial space and a single final space.

E-mail us with Better Settings

If you find settings that work better than ours, please let us know at lingpipe@alias-i.com.

SigHan 2005 Bakeoff

A week after writing the 2003 SigHan demo, the

Second International Chinese Word Segmentation Bakeoff

was held. The organizers again distributed the data for research purposes after the bakeoff. This section describes running LingPipe on that data.

Segmentation Standards

The segmentation standards for the four groups are linked from the following table.

Corpus CreatorWord Segmentation Standards
Academia SinicaSegmentation Standard (pdf)
City University Hong KongSegmentation Standard (pdf)
Peking UniversitySegmentation Standard (pdf)
Microsoft ResearchSegmentation Standard (pdf)

Downloading Data

The data's available as a single .zip file:

icwb2-data.zip [50MB]

This time, the organizers transcoded UTF8 versions of the input files. Our code runs straight off the zip, so you don't even need to unpack it.

The zip file contains the following corpora:

2005 SigHan Bakeoff Data Zip File
Creator Train Test
Sentences Uniq Words Uniq Chars Sentences Uniq Unknown Words Uniq Unknown Chars
Academia Sinica 708,953 141,338 6115 14,432 3227 85
Microsoft 86,924 88,119 5167 3985 1991 12
HK City Uni 54,019 69,085 4923 1493 1670 60
Peking Uni 19,056 55,303 4698 1945 2863 91

Source Code

The source code to run the 2005 examples is in:

src/ChineseTokens05.java

It only differs from the earlier code in the way it constructs input streams from which to read the training and test data.

Running the Tests

There's an Ant task for each corpora in the task. They're distinguished from the others by the suffix 05.

The Results

The following table presents the LingPipe results as achieved by training character 5-grams in LingPipe. These results would've put us in the "closed" category for the competition, meaning that the only linguistic information used to build the system was the training data (e.g. no dictionaries, no heuristic morphology, no POS taggers trained on other corpora).

2005 SigHan Bakeoff Chunk-Level Scoring
Corpus Default LingPipe Results Winning Bakeoff Result
Prec Rec F Prec Rec F Closed Winning Site
Academia Sinica 0.956 0.979 0.968 0.951 0.952 0.952 Yes Nara Inst
0.950 0.962 0.956 No Nat Uni Singapore
Microsoft Res 0.962 0.967 0.965 0.966 0.962 0.964 Yes Stanford
0.965 0.980 0.972 No Harbin Inst
HK City Uni 0.927 0.928 0.928 0.946 0.941 0.943 Yes Stanford
0.956 0.967 0.962 No Nat Uni Singapore
Peking Uni 0.935 0.925 0.930 0.946 0.953 0.950 Yes Yahoo
0.969 0.968 0.969 No Nat Uni Singapore

These results show a substantial amount of variation across corpora. Because most systems were applied to most corpora, this also represents a very diverse range of "best" approaches. With more training data, the statistical confidence intervals are much smaller, especiall

This was a nice bakeoff in that many of the cooks can add to their trophy cases. The best overall system was Wei Jang's closed entry for Harbin Institute on the Microsoft corpus, with an F-measure of 0.972 (and also represents a large error reduction over Jang's own closed submission for that corpus). Hwee Tou Ng from the National University of Singapore swept the closed category for all three other corpora. Huihsin Tseng, a U. Colorado student, made an excellent showing as well, taking two of the closed categories while playing for his advisor's team (Stanford).

LingPipe would've placed first in the closed category for two of the corpora: Academia Sinica and Microsoft Research. Perhaps not coincidentally, these are the two largest corpora. Surprisingly, LingPipe's closed results for the AS corpus are better than the best open results submitted to the bakeoff. I wonder if some of the other systems may have been confused by the mixture of unicode half-width spaces (0x3000) and regular ASCII single spaces (0x0020) in the AS corpus? It required us to generalize our inter-token whitespace regular expression to "(\\s|\u3000)+".

Official Results

The official results page is:

SigHan 2005 Official Results

References