What is Hyphenation and Syllabification?

The simplest way to understand hyphenation and syllabification is through some examples.

Some Examples

Here are three examples taken from dictionary.reference.com, which federates several popular dictionaries, including their own. In the following table, the top row here is the word, with a hyphenation and syllabification below it:

sidelandedpublisher
Hyphenationsideland - edpub - lish - er
Syllabificationsɪdlæn + dɪdpʌb + lɪ + ʃər

Hyphenation

The hyphenation process inserts hyphens at various points in the word itself. These hyphenations are used primarily in typesetting, where words may be broken over a line and a hyphen inserted at a legal hyphenation point. Thus hyphenation is an orthographic process, meaning it works over the spellings of words in a language.

Syllabification

Syllables, on the other hand, break the sequence of spoken sounds into units. The sounds of a language are here written in the International Phonetic Alphabet (IPA), a widely used system for writing down the distinctive sounds of the world's languages. The number of syllables in a word roughly correspond to the number of spoken vowel clusters in a word. For instance, the word "side" has two vowels, but only the first is spoken, so it only has a single syllable.

Boundary Ambiguity

Exactly where the boundaries go for hyphenation and syllabification is another matter. There are no "official" definitions of syllables or hyphens. Hyphenation may vary by dictionary or publishing standard (e.g. The Chicago Manual of Style). Hyphenation tends to follow both the word form and the syllabification. But consider "landed", where the hyphenation and syllabification disagree about where the first "d" should be. As is typical with syllabification, consonants tend to pile up at the front of syllables, whereas in hyphenation, words tend to get split at morphological boundaries (that is, between stems and affixes in English). Some linguists admit so-called ambisyllabicity, which involve sounds at boundaries that belong to two syllables at once.

Words vs. Usages

It's not actually possible to solve the hyphenation problem for words, per se, because hyphenation really depends on the form of the word used. Consider the word "invalid", which when used as a noun is hyphenated in-va-lid and when used as an adjective is hyphenated in-val-id. Note that there's lexical stress involved at the pronounciation level, with the noun stressing the first syllable and the adjective the second.

For the purposes of this tutorial, we'll just fix a dictionary and take its definitions as the reference standard, and then build systems that can recreate the dictionary's decisions.

How is it Done?

Rules and Taggers

There are several approaches to hyphenation and syllabification in the literature, ranging from traditional rule-based transducers to k-nearest neighbors approaches up through structured support vector machines.

As a tagging problem, the spaces between letters need to be assigned as either a break point or a non-break point. This should be able to produce the best tagger, as it can use all of the information we include in the noisy channel model and also emphasize the discriminative role of the boundaries. Having said that, the model we present here is state-of-the-art in terms of accuracy.

The Noisy Channel Model

Our approach follows that of our Chinese word tokenization, which in turn is based on spelling correction. Specifically, we use what's known as a noisy channel model. A noisy channel model has two components, a source and a channel. The source represents a probabilisitic message generator. In this case, the hyphenated words. The channel represents noise in transmission. In our case, we use a deterministic channel that simply deletes the hyphenations. The recipient's job is then to decode the original message with hyphens from the received message, which has no hyphens.

Given a source model p(h) for hyphenated words, and a channel model p(w|h) defined so that p(w|h) = 1 if w is equal to h with the hyphens removed and 0 otherwise. We then seek to find the most likely source message h to have produced message w by:

ARGMAXh p(h|w) = ARGMAXh p(w|h) p(h) / p(w)
               = ARGMAXh p(w|h) p(h)         
               = ARGMAXh s.t. strip(h)=w p(h)

where we use strip(h) = w to mean that w is equal to h with the hyphenations stripped out (in Java terms, h.replaceAll(" ","").equals(w)). Thus with a deterministic channel, we wind up looking for the most likely hyphenation h according to p(h), restricting our search to h that produce w when the hyphens are stripped out.

The Moby Corpus

For our quick getting started example, we'll use a free corpus of English hyphenations.

Downloading the Corpus

Grady Ward produced a free corpus of English hyphenations, which is available from Project Gutenberg:

We'll assume you've downloaded the corpus and unzipped it into a file $MOBY_RAW_DATA (the data unpackes into a file called mhyph.txt).

> cd $MOBY_RAW_DATA
> unzip mhyph10.zip

Munging the Corpus

The corpus is encoded as bytes, with byte sequence 0xFF 0xFD used to indicate hyphens. There's a simple program src/MobyHyphenCorpus.java which we wrote to munge the corpus into our official input format, which is one word per line separated by spaces. It splits out space-separated compounds into their own entries.

The munger can be run through the ant target moby-hyphens (be sure to replace $MOBY_RAW_DATA as described above):

> ant -Dmoby.raw.file=$MOBY_RAW_DATA/mhyph.txt -Dhyphens.file=moby-hyphens.txt moby-hyphens

Raw Input=e:\data\moby-hyphens\unpacked\mhyph.txt
Hyphenation Output=moby-hyphens.txt
Found #input lines=187175
Found #hyphenations=173781

We don't discuss this corpus in much detail as it seems to be very noisy, including all sorts of non-English words and non-words with fairly inconsistent hyphenations. Here's a small sampleof the resulting file moby-hyphens.txt:

...
a bid ing ness
a bid jan
a bide
a bie
a bil i ty
a bim e lech
a bin o am
a bin o em
...

Running Hyphenation Cross-Validation

We can then take this corpus and run it through our hyphenation cross-validation program (fully explained below), using the ant target xval-hyphens (it'll take a while to run without spitting out any feedback in non-verbose mode):

> ant -Dhyphens.file=moby-hyphens.txt -Dverbose=FALSE xval-hyphens

Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\moby-hyphens.txt
     #words=173732
     #hyphens=396978
     #hyphen insertion points=1439791

ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=false

OVERALL ACCURACIES ACROSS FOLDS

PER HYPHENATION: PREC=0.949 RECALL=0.950 F(1)=0.949

PER TAGGING DECISION: ACC=0.9721 #DECISIONS=1439791

WHOLE WORD: ACCURACY=0.877 DEV=0.002

We describe the code used for training and testing the model below, including a description of what all the evaluation numbers mean. We also discuss a couple other modes of operation other than the simple language-model-based noisy channel that can be used.

Webster's Unabridged Dictionary

A quirky encoding of the 1890 version of Webster's Unabridged Dictionary is also available from Project Gutenberg:

We'll assume you've unzipped the distribution into a file.

Munging the Corpus

A program to extract the hyphenations may be found in WebstersHyphensCorpus.java. It's just a bunch of scraping code that's not worth discussing, other than to mention that we decompound compounds and hyphenated forms, and remove any residual characters containing non-ASCII letters.

The munging program may be run from the ant target websters-hyphens (be sure to replace $MOBY_RAW_DATA as decribed above):

ant -Dhyphens.file=websters-hyphens.txt -Dwebsters.raw.file=$MOBY_RAW_DATA/pgwht04.txt websters-hyphens

BAD: .
BAD: 1
BAD: ab b\'82
BAD: ac a le ph\'91
BAD: a ch\'91 an
BAD: ach ro \'94 dex trin
...
BAD: zo \'94 troph ic
BAD: zu &ntil;is
BAD: zyg o dac ty l\'91
# rejected entries=1433
# retained entries=95349
DONE.

It prints out the entries that were rejected, reporting that a total of 1433 entries were rejected, while 95,349 were retained. A better version of this program would figure out which Latin1 characters the numerical escapes referred to and do a replace. For instance, \'91 is clearly unicode character U+00E6 (æ).

Evaluating Hyphenation

We can cross-validate on the Webster's dictionary data the same way as for the Moby corpus, with the xval-hyphens ant target:

> ant -Dhyphens.file=websters-hyphens.txt -Dverbose=FALSE xval-hyphens

Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\websters-hyphens.txt
     #words=93750
     #hyphens=191118
     #hyphen insertion points=708320

ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=false

NGRAM=  8 INTERP= 4.0 ACC=0.880 P=0.940 R=0.941 F=0.941

OVERALL ACCURACIES ACROSS FOLDS

PER HYPHENATION: PREC=0.940 RECALL=0.941 F(1)=0.941

PER TAGGING DECISION: ACC=0.9679 #DECISIONS= 708320

WHOLE WORD: ACCURACY=0.880 DEV=0.003

This doesn't look very good from the whole word perspective, though the per-hyphenation scores are OK. This may be due to the number of non-English words and names in Webster's dictionary.

> ant -Dhyphens.file=websters-hyphens.txt -Dverbose=TRUE xval-hyphens

Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\websters-hyphens.txt

Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\h
Rejecting word=fellowship
Rejecting word=griev'ance
Rejecting word=thermometer
Rejecting word=worthwhile
     #words=93750
     #hyphens=191118
     #hyphen insertion points=708320
ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=true

EVALUATING FOLD=0/10
     Training Hyphenator
     Compiling Hyphenator
                REFERENCE                  RESPONSE CORRECT
             @duc ti ble@              @duct i ble@   false
                @deu sed@                  @deused@   false
            @ble ton ism@             @ble to nism@   false
                  @lil y@                    @lily@   false
              @dil u ent@               @di lu ent@   false
                 @sau te@                   @saute@   false
        @dem o ni a cism@         @de mo ni a cism@   false
                 @secant@                 @se cant@   false
             @in flat ed@              @in fla ted@   false
                @var den@                 @vard en@   false
                 @mon te@                   @monte@   false
          @je ron y mite@           @jer on y mite@   false
         @sec u rif e ra@          @se cu rif e ra@   false
             @bur let ta@              @burl et ta@   false
      @non med ul la ted@       @non me dul la ted@   false
                 @fe rie@                  @fer ie@   false
                @cist ic@                 @cis tic@   false
                  @piste@                  @pis te@   false
           @ruhm korff's@           @ruhm korff 's@   false

On the left is the reference value in the dictionary and on the right the system responses. Note that four words were rejected because they were very long and had no hyphneation. The heuristic obviously misses errors like the word "secant" above, which should have at least one hyphen.

The CELEX2 Corpus

Rather than spidering and scraping a web site, for the purposes of this tutorial we use the CELEX2 corpus from the Linguistic Data Consortium. CELEX2 contains data for English, German and Dutch, including spelling, syllabification and hyphenation, created by a consortium of Dutch research institutes. The CELEX2 corpus is only US$300 for non-members, but even for LDC members, comes with a very restrictive research-only license.

We will assume CELEX2 has been downloaded and then unpacked from its gzipped tarball into a directory we'll call $CELEX2. CELEX2 provides various annotations, each in their own files. For English hyphenation, the relevant file is $CELEX2/english/eow/eow.cd. Each entry is on its own numbered line. Here's a sample:

...
147\abolish\27\78\1\B\27\78\a-bol-ish
148\abolished\31\78\1\B\31\125\a-bol-ished
149\abolishes\3\78\1\B\3\0\a-bol-ish-es
...

It's very easy to pull out the hyphenated versions of the words for our data set. There is a small complication in that both hyphenated words and compounds are allowed as inputs, with spaces and double-hyphens in the output respectively:

...
75889\run-around\2\39595\1\B\2\0\run--a-round
75890\run around\0\39594\1\B\0\0\run a-round
...

To handle this issue, we'll just collect all the single words, splitting on double-hyphens and spaces.

The final complication is that there can be more than one annotation per line, as in:

...
235\abridgments\0\124\2\B\0\0\a-bridg-ments\abridgements\B\0\0\a-bridge-ments
...

In this case, an alternate spelling ("abridgments" vs. "abridgements"), with corresponding hyphenations.

CELEX Guidelines for English, Dutch and German

The tagging guides for Dutch, English, and German may be found at the top-level directory for each language:

There's also an online description of the contents of the corpus.

Among other resources, these guides (at least the English one) contain a mapping between IPA symbols and the symbols used in CELEX for phonemes.

Creating the Input Data

Running the Extractor

We pull the input data out of the CELEX2 file through the ant target celex-hyphens, with input file and output file specified with a -D property declaration:

> ant -Dcelex.ow.file=e:\data\celex2\unpacked\celex2\english\eow\eow.cd -Dhyphens.file=celex-english-hyphens.txt celex-hyphens

# hyphenations=71476
# unique words=71363
[naafi, naaf i]
[de serts, des erts]
[coun ci lors, coun cil ors]
[reb el, re bel]
[cursed, curs ed]
...
[in valid, in val id, in va lid]
...
RESIDUAL ESCAPES=[]

BUILD SUCCESSFUL
Total time: 6 seconds

Multiple Hyphenations

Note that there are 71,476 unique hyphenations, but only 71,363 words from which they were drawn. The difference is made up by words with multiple hyphenations, which are listed below the counts (truncated here). The word "invalid" is unique in having three hyphenations, though the first one appears to be an error (unless that last syllable's pronounced like "vlid", but in general, reductions like that aren't listed in the corpus). The system we will produce will generate a first-best hyphenation for each word; under this scheme, the system will make at least 113 mistakes on the ambiguous English data.

Diacritics and Special Symbols

The CELEX2 corpus uses ASCII encodings of characters, employing their own custom diacritic encodings, listed in their README file(s) [this combines the Dutch, English and German accent sets, which are consistent]:

  diacritic symbol           diacritic sign
        #                           '         (acute accent)
        `                           `         (grave accent)
        "                           "         (diaeresis)
        ^                           ^         (circumflex accent)
        ,                           ,         (cedilla)
        ~                           ~         (tilde)
        @                           o         (ring)        

The German Eszett symbol (ß unicode U+00DF) is encoded by the dollar-sign character '$'.

Multiple Forms of the Same Word

There are also multiple entries and entries with apostrophes for posssessives and contractions:

...
8378\boatswains\1\4518\4\B\0\0\boat-swains\bosuns\B\0\0\bo-suns\bo's'ns\B\0\0\bo's'ns\bo'suns\B\1\0\bo'suns
...

In this example, the word "boatwsains" is hyphenated as boat-swains, and then additional varieties are supplied on the same line, such as "bosuns", "bo's'ns", and "bo'suns", each with their own hyphenation.

Commas in Phrasal Entries

There's an error in some of the English encodings, which have commas in the entries, even though commas are reserved for cedillas. We just clean up by breakong in comma plus space in breaking multiple words.

42215\hop, skip, and jump\0\21578\1\B\0\0\hop, skip, and jump
42216\hop, step, and jump\0\21579\1\B\0\0\hop, step, and jump

Finished Data

The finished product, the file for which was specified as celex-english-hyphens.txt in the ant target invocation above, is UTF-8 encoded, with one entry per line, with spaces representing hyphens:

'd
'em
'll
...
a
a back
a baft
a ban don
a ban don ing
...
é pée
é pées

There are remaining entries with hyphens for contractions and possessives.

Hyphen Extractor Source Code

The source code for the extractor is in src/CelexHyphenCorpus.java. It's not particularly interesting, as it merely walks through the data stripping out pronunciations, breaking compounds into single words, collecting counts, and reinserting the diacritics.

Cross-Validating Hyphenation

Like our other evaluations, we set up hyphenation to run through ten-fold cross-validation. This means we will divide the data up into ten random sets, known as folds, and then for each fold in turn, train on the other nine folds and test on the given fold. Results get reported as averages.

> ant -Dverbose=false -Dhyphens.file=celex-english-hyphens.txt xval-hyphens

ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=false

Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-english-hyphens.txt
     #words=71476
     #hyphens=121794
     #hyphen insertion points=524910

FOLD = 0 ACCURACY=0.955
FOLD = 1 ACCURACY=0.957
FOLD = 2 ACCURACY=0.957
FOLD = 3 ACCURACY=0.958
FOLD = 4 ACCURACY=0.955
FOLD = 5 ACCURACY=0.956
FOLD = 6 ACCURACY=0.960
FOLD = 7 ACCURACY=0.952
FOLD = 8 ACCURACY=0.958
FOLD = 9 ACCURACY=0.956

OVERALL ACCURACIES ACROSS FOLDS

PER HYPHENATION: PREC=0.975 RECALL=0.976 F(1)=0.975

PER TAGGING DECISION: ACC=0.9886 #DECISIONS= 524910

WHOLE WORD: ACCURACY=0.956 DEV=0.002

Evaluations

Before going into the code, we'll describe the three distinct lines of the evaluations.

Whole Word Accuracy

The simplest evaluation is whole word accuracy, which is simply the number of words that were hyphenated correctly divided by the number of words. The whole word accuracy for this run was 0.956, with a deviation of 0.002 derived from the ten runs, which varied in accuracy between 0.952 and 0.960.

It's instructive to compute the binomial confidence interval (a kind of Z-test, which is appropriate here due to large sample size), which is based on a standard deviation calculation, where p is the success probability (here 0.956) and N is the number of trials (71,476):

dev(Binomial(p,N)) = sqrt(p * (1-p) / N)

Note that the result is normalized (by dividing by N) back to the probability scale. The actual deviation in the binomial outcome is sqrt(N * p * (1-p)), which is then divided by N to yield sqrt(p * (1-p)/n).

Plugging in our numbers yields 0.000767. A 95% confidence interval is roughly two times the deviation in either direction, or, +/- .00153, for a confidence interval of (0.9545, 0.9575), which appears to be a bit too narrow given the cross-validation results (see the sidebar).

Per-Hyphenation Precision and Recall

Treating the hyphenations as objects to be found, we can report precision and recall values. There's a total of 121,794 hyphens in the corpus. As usual, precision is the percentage of hyphens the model returns that are correct, whereas recall is the percentage of correct hyphenations returned by the model. The numbers here are precision=0.975 and recall=0.976. That is, we found 97.6% of the hyphens and 97.5% of what we found was correct. The F-measure is their harmonic mean, which is 0.975.

Per-Decision Accuracy

Finally, we can look at each position between a pair of characters in the input to judge whether we made the right decision, hyphen or no-hyphen. There's a total of 524,910 such decision points, and our overall accuracy is 0.9886.

Verbose Reports Per Word

Reports on all the errors made by the system on a word-by-word basis are available by setting the verbose property to true, which it is by default. Here's a sample:

> ant -Dhyphens.file=celex-english-hyphens.txt xval-hyphens

....
EVALUATING FOLD=0/10
     Training Hyphenator
     Compiling Hyphenator
                REFERENCE                  RESPONSE CORRECT
                @lla mas@                @ll a mas@   false
               @py ja ma@                @py jam a@   false
           @fril li ness@            @frill i ness@   false
                 @trag i@                  @tra gi@   false
               @ser aphs@                @se raphs@   false
          @he do nis tic@           @he don is tic@   false
          @mo ral i ties@           @mor al i ties@   false
              @frig ates@               @fri gates@   false
             @cha ris ma@              @char is ma@   false
...
           @step fathers@           @step fa thers@   false
...
   @dis in fla tion a ry@    @dis in fla tion ar y@   false
...
            @vaude ville@            @vau dev ille@   false

The labeled data's hyphenation is listed on the left under REFERENCE, with the system's answer listed on the right under RESPONSE.

Most of the mistakes involve off-by-one break locations, but some of these lead to the wrong number of syllables, as for "llamas" and "vaudeville".

Some of the mistakes appear to be errors in the training data, such as "step-fathers", which clearly should have a hyphenation point in "fathers". Other errors are questionable, such as "dis-in-fla-tion-a-ry", whose hypehnation in dictionary.reference.com matches our response. In part, this indicates how sketchy the "rules" are for hyphenation in English.

Code Walkthrough

The cross-validation code is available in src/XvalHyphenation.java. The portions of the code to train, compile, and run on new data is very simple. Most of the code's tied up in creating the folds and running evaluations.

Spell Checking

Hyphenation is based on our spell checking API, which is further explained in the spell check tutorial and in the Chinese tokenization tutorial.

Problem Encoding

To cast the problem into our spelling framework, we wrap each word in a start-word and end-word symbol, both chosen to be '@' here. Using different symbols for start and end doesn't change our results, so we went with the simpler encoding. We then just train the model by feeding it these wrapped words.

Corpus Parsing

There's code to parse the corpus in method parseCorpus(). The only piece of this relevant to the API is the piece that appends the start-of-word and end-of-word symbols:

public static String[] parseCorpus(File hyphenationDataFile) throws IOException {
    ...
    String[] hyphenatedWords = new String[hyphenatedWordList.size()];
    for (int i = 0; i < hyphenatedWords.length; ++i)
        hyphenatedWords[i] = "@" + hyphenatedWordList.get(i) + "@";

and the piece that permutes the array of hyphenated words (using our utility class util.Arrays):

    com.aliasi.util.Arrays.<String>permute(hyphenatedWords, new Random(42));

The random seed is hard-coded here so that the results are the same. This is a re-usable trick for debugging methods with a random component. Changing the seed changes the results, as there's a truly random element to cross-validation.

Model Training

Training the spell checking model is done in the usual way:

static CompiledSpellChecker compileHyphenator(String[] hyphenatedWords, 
                                              int fold, int numFolds,
                                              int maxNGram, int numChars, 
                                              double interpolationRatio)
    throws IOException, ClassNotFoundException {

    NGramProcessLM lm 
        = new NGramProcessLM(maxNGram,numChars,interpolationRatio);
    WeightedEditDistance distance
        = CompiledSpellChecker.TOKENIZING;
    TrainSpellChecker trainer 
        = new TrainSpellChecker(lm,distance,null);
    for (int i = 0; i < hyphenatedWords.length; ++i)
        if (i % numFolds != fold)
            trainer.handle(hyphenatedWords[i]);
    ...

First, a process n-gram language model is created using the specified parameters for n-gram length, interpolation ratio and number of characters for the baseline uniform districution.

Next, the weighted edit distance is set to the constant CompiledSpellChecker.TOKENIZING, which implements the deterministic channel model by making space insertion and matching cost zero and all other edits cost infinity.

Then, the model is compiled using the utility method compile() from util.AbstractExternalizable:

        CompiledSpellChecker hyphenator
            = (CompiledSpellChecker) AbstractExternalizable.compile(trainer);
        hyphenator.setAllowInsert(true);
        hyphenator.setAllowMatch(true);
        hyphenator.setAllowDelete(false);
        hyphenator.setAllowSubstitute(false);
        hyphenator.setAllowTranspose(false);
        hyphenator.setNumConsecutiveInsertionsAllowed(1);
        hyphenator.setFirstCharEditCost(0);
        hyphenator.setSecondCharEditCost(0);
        hyphenator.setNBest(HYPHENATION_N_BEST);

        return hyphenator;

After the model is compiled, it is configured for efficiency using methods in CompiledSpellChecker. In particular, it allows inserts (for the spaces), but limits the number of consecutive insertions to one. The code allows matches, but all other edits are disallowed.

This code also sets the first and second character edit cost penalties to 0, putting them on an equal footing with other characters. For ordinary spelling correction, there is a much lower chance of first and second characters being wrong than characters later in the word.

Finally, the n-best list is set to a constant. In general, the n-best list should be set as low as possible without hurting accuracy, because fatter n-best lists mean slower parsing time.

Compiling to File

The trained model could have been compiled to a file using the method compileTo(ObjectOutput) in the compiled spell checker or the utility method AbstractExternalizable.compileto(Compilable,File), because TrainSpellChecker implements util.Compilable.

After compilation to a file, the model may be read back in through ordinary serialization (that is, an ObjectInput), or using the utility method AbstractExternalizable.readObject(File), cast to a CompiledSpellChecker. Don't forget to run the configuration on the compiled spell checker following the code above.

Evaluation

The evaluation happens in the method evaluateFold(), which is called for each fold value (from 0 to the number of folds):

    static void evaluateFold(String[] hyphenatedWords, int fold, int numFolds, double[] accuracies,
                             PrecisionRecallEvaluation prEval) 
        throws ClassNotFoundException, IOException {

        CompiledSpellChecker hyphenator = compileHyphenator(hyphenatedWords,fold,numFolds,
                                                            MAX_NGRAM,NUM_CHARS,INTERPOLATION_RATIO);

        ...

Note that a precision-recall evaluation is passed in, so that it may be re-used across all the folds to provide a cumulative answer for the whole evaluation. The array accuracies is to fill with fold accuracies so that means and variance may be reported once all the folds are run.

The first call is just to compile the hyphenator. Then it loops over the word array, evaluating on a case-by-case basis:

    int numCases = 0;
    int numCorrect = 0;
    for (int i = 0; i < hyphenatedWords.length; ++i) {
        if (i % numFolds == fold) {
            String hyphenatedWord = hyphenatedWords[i];
            String unhyphenatedWord = hyphenatedWord.replaceAll(" ","");
            String rehyphenatedWord = tamp(hyphenator.didYouMean(unhyphenatedWord));
            boolean correct = hyphenatedWord.equals(rehyphenatedWord);
            ++numCases;
            if (correct)
                ++numCorrect;
            updatePrEval(prEval,hyphenatedWord,rehyphenatedWord);
        }
    }
    double accuracy = numCorrect / (double) numCases;
    accuracies[fold] = accuracy;

We keep track of overall accuracy using counters, and evaluate on the fold at hand (determined by the mod arithmetic test). The oringla hyphenated word is retrieved, and the spaces removed using a replace operation. Then it is fed to the didYouMean() method of the hyphenator (named after its use for spelling correction). The answer is correct if it's equal to the input, and counts are updated according. Finally, the precision-recall evaluation is updated given the reference hyphenation from the corpus and system response hyphenation. Outside of the loop over cases, overall accuracy is recorded on the array passed in.

The method tamp() applied to the hyphenation gets rid of any spurious hyphens between the start and end boundaries and the word:

static String tamp(String hyphenation) {
    return "@" + hyphenation.replaceAll("@","").trim() + "@";
}

Precision-Recall Evaluation

The precision-recall eval uses the LingPipe class classify.PrecisionRecallEvaluation, which is a general confusion-matrix based scorer.

static void updatePrEval(PrecisionRecallEvaluation prEval, String reference, String response) {
    Set<Integer> referenceBoundarySet = getBoundarySet(reference);
    Set<Integer> responseBoundarySet = getBoundarySet(response);

    Set<Integer> universalSet = new HashSet<Integer>();
    universalSet.addAll(referenceBoundarySet);
    universalSet.addAll(responseBoundarySet);
    for (Integer i : universalSet) {
        boolean ref = referenceBoundarySet.contains(i);
        boolean resp = responseBoundarySet.contains(i);
        prEval.addCase(ref,resp);
    }
}

The method getBoundarySet(String) returns a set of the integers representing the index of the characters before hyphenation points, not counting the hyphenation characters (spaces). For instance, foo bar ly would return the set {2, 5}. We then create the union of the reference and response boundaries, then iterate over these, calculating for each whether it's a reference and/or response tagging. These are added as cases to the precision-recall eval. For instance, adding true,true is a true positive, true,false is a false negative, and false,true a false positive.

static Set<Integer> getBoundarySet(String hyphenatedWord) {
    Set<Integer> boundarySet = new HashSet<Integer>();
    int pos = 0;
    for (int i = 0; i < hyphenatedWord.length(); ++i) {
        if (hyphenatedWord.charAt(i) == ' ')
            boundarySet.add(pos);
        else
            ++pos;
    }
    return boundarySet;
}

The rest of the code is just setup and print statements. Here's how the results get reported given the precision-recall eval, number of decisions (which there's a method to compute given the corpus) and accuracies with deviations:

public static void report(PrecisionRecallEvaluation prEval, int numTagDecisions,
                          double[] accuracies) {

    System.out.printf("\nPER HYPHENATION: PREC=%5.3f RECALL=%5.3f F(1)=%5.3f\n",
                       prEval.precision(),
                       prEval.recall(),
                       prEval.fMeasure());

    double numTagErrors = prEval.falsePositive() + prEval.falseNegative();
    double perTagAccuracy = (numTagDecisions - numTagErrors) / (double) numTagDecisions;
        
    System.out.printf("\nPER TAGGING DECISION: ACC=%5.4f #DECISIONS=%7d\n",
                       perTagAccuracy,numTagDecisions);

    System.out.printf("\nWHOLE WORD: ACCURACY=%5.3f DEV=%5.3f\n",
                      Statistics.mean(accuracies),
                      Statistics.standardDeviation(accuracies));

}

An Interpolated Bidirectional Model

Like many left-to-right sequence models, the language-model based noisy channel model suffers from the label bias problem.

Without resorting to conditional random fields or structured support vector machines, we can mitigate the label bias problem somewhat by running models in both directions. That is, we train once with the usual data, then we train a second model with all of the words reversed. To produce a single answer, we can either combine the probability estimates of the forward and backward models using the n-best lists from spelling. We provide a model that does this in src/XvalHyphenationBidi.java.

N-best Spell Checker Output

Other than a whole lot of bookkeeping to track when a string is forward and backward, the only interesting innovation here is in combining the n-best lists from the spell checkers. This is done with a helper method nBest() which converts the output to a mapping from hyphenations to scores:

static ObjectToDoubleMap<String> nBest(CompiledSpellChecker hyphenator, String word, boolean reverseIO) {
    Iterator<ScoredObject<String>> nBest = hyphenator.didYouMeanNBest(reverseIO ? reverse(word) : word);
    ObjectToDoubleMap<String> nBestMap = new ObjectToDoubleMap<String>();
    while (nBest.hasNext()) {
        ScoredObject<String> scoredHyphenation = nBest.next();
        String hyphenation = XvalHyphenation.tamp(scoredHyphenation.getObject());
        Double score = Math.pow(2.0,scoredHyphenation.score());
        String orderedHyphenation = reverseIO ? reverse(hyphenation) : hyphenation;
        Double currentScore = nBestMap.get(orderedHyphenation);
        if (currentScore != null) {
            // happens when two inputs tamp to same val
            score = currentScore + score;
        }
        nBestMap.set(orderedHyphenation,score);
    }   
    double totalProb = 0.0;
    for (String key : nBestMap.keySet()) {
        totalProb += nBestMap.get(key);
    }
    ObjectToDoubleMap<String> conditionalNBestMap = new ObjectToDoubleMap<String>();
    for (String key : nBestMap.keySet()) {
        double conditionalProb = nBestMap.get(key) / totalProb;
        conditionalNBestMap.set(key,conditionalProb);
    }
    return conditionalNBestMap;
}

This just walks over the scored object output of the n-best spell checker. Scored objects include an object, in this case a string representing a hyphenation, and a score, in this case a joint log likelihood of the string and its hyphenation. In some cases, the result of tamping down the spaces around the boundaries leads to two or three entries on the n-best list that correspond to the same hyphenation. In these cases, we convert to a linear scale, add values, and convert back to a log scale for the final result.

The last step of the algorithm sums the linear probabilities p(h,w) to get the marginal probability p(w) of the underlying string, then divides the linear probabilies by the marginal to get the conditional, p(h|w) = p(h,w)/p(w).

Combining Forward and Backward Results

The hyphenation method just adds the forward and backward estimates and divides by two to get an interpolated score:

p(h|w) = 0.5 * (pF(h|w) + pB(h|w))

Different interpolation ratios other than 0.5/0.5 might produce different results. For instance, at 1.0/0.0 it produces the forward results and at 0.0/1.0 the backward results.

static String hyphenate(String word,
                        CompiledSpellChecker hyphenator,
                        CompiledSpellChecker hyphenatorReversed) {
    String reversed = reverse(word);
    ObjectToDoubleMap<String> fwdNBest = nBest(hyphenator,word,false);
    ObjectToDoubleMap<String> revNBest = nBest(hyphenatorReversed,word,true);
    ObjectToDoubleMap<String> nBest = new ObjectToDoubleMap<String>();
    for (String hyphenation : fwdNBest.keySet()) {
        double fwdScore = fwdNBest.getValue(hyphenation);
        double revScore = revNBest.getValue(hyphenation);   
        nBest.set(hyphenation, 0.5 * fwdScore + 0.5 * revScore);
    }
    return nBest.keysOrderedByValueList().get(0);
}

Basically, this just takes the two values and interpolates them evenly, adding them to a map. The final return just takes the top scoring value from the list and returns it.

Bidirectional Results

Unfortunately, the bidirectional results look almost identical to the unidirectional forward results.

> ant -Dhyphens.file=celex-english-hyphens.txt xval-hyphens

ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=256

Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-english-hyphens.txt
     #words=71476
     #hyphens=121794
     #hyphen insertion points=524910

FOLD = 0 ACCURACY=0.955
FOLD = 1 ACCURACY=0.958
FOLD = 2 ACCURACY=0.956
FOLD = 3 ACCURACY=0.959
FOLD = 4 ACCURACY=0.955
FOLD = 5 ACCURACY=0.956
FOLD = 6 ACCURACY=0.960
FOLD = 7 ACCURACY=0.953
FOLD = 8 ACCURACY=0.959
FOLD = 9 ACCURACY=0.955

OVERALL ACCURACIES ACROSS FOLDS

PER HYPHENATION: PREC=0.975 RECALL=0.975 F(1)=0.975

PER TAGGING DECISION: ACC=0.9885 #DECISIONS= 524910

WHOLE WORD: ACCURACY=0.957 DEV=0.002

The whole word accuracy is .001 better, the recall is .001 worse, and the per-tagging accuracy .0001 worse. This makes the bidirectional system statistically indistinguishable from the forward system.

An Intersecting Bidirectional Model

The final variant we consider is motivated by the fact that hyphenation is a precision-oriented problem. This is because of the use case, which is inserting hyphens in printed material. As long as there's a reasonable hyphenation point somewhere near the middle of long words, overall recall is less important. With that in mind, we will deliver a system that simply intersects the first-best hyphens in the forward and backward models.

The Intersection Code

Training works just as in the bidirectional interpolation model, but decoding works by intersecting (or unioning, depending on the command-line flag) the tags in the forward and backward first-best results.

The code is in src/XvalHyphenationBiDiIntersect.java. The only interesting difference from previous code takes the first-best hyphenations from the forward and backward models and intersects (or unions them).

static String intersect(String hyphenationFwd, String hyphenationRev) {
    boolean matched = hyphenationFwd.equals(hyphenationRev);
    if (matched) 
        return hyphenationFwd;

    Set<Integer> hyphenSetFwd = XvalHyphenation.getBoundarySet(hyphenationFwd);
    Set<Integer> hyphenSetRev = XvalHyphenation.getBoundarySet(hyphenationRev);

    if (INTERSECT)
        hyphenSetFwd.retainAll(hyphenSetRev);
    else
        hyphenSetFwd.addAll(hyphenSetRev);
    int k = 0;
    int[] hyphenPositions = new int[hyphenSetFwd.size()];
    for (Integer hyphenPosition : hyphenSetFwd) {
        hyphenPositions[k++] = hyphenPosition;
    }
    Arrays.sort(hyphenPositions);
    String hyphenation = hyphenationFwd.replaceAll(" ",""); // strip hyphens
    for (int i = hyphenPositions.length; --i >= 0; ) {
        hyphenation = hyphenation.substring(0,hyphenPositions[i])
            + " " + hyphenation.substring(hyphenPositions[i]);
    }
    return hyphenation;
}

It's just a bunch of index fiddling. The boolean flag INTERSECT determines if the results are unioned or intersected. It's controlled by the property intersect from the command line.

Results from Intersection

The code can be run from the ant target xval-hyphens-bidi-int, with the verbosity and corpus specified on the command line, and additionally, a flag indicating whether to intersect or not (a false value yields union).

> ant -Dverbose=false -Dintersect=true -Dhyphens.file=celex-english-hyphens.txt xval-hyphens-bidi-int

Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-english-hyphens.txt
     #words=71476
     #hyphens=121794
     #hyphen insertion points=524910
FOLD = 0 ACCURACY=0.948
FOLD = 1 ACCURACY=0.949
FOLD = 2 ACCURACY=0.950
FOLD = 3 ACCURACY=0.952
FOLD = 4 ACCURACY=0.949
FOLD = 5 ACCURACY=0.948
FOLD = 6 ACCURACY=0.953
FOLD = 7 ACCURACY=0.944
FOLD = 8 ACCURACY=0.949
FOLD = 9 ACCURACY=0.947

OVERALL ACCURACIES ACROSS FOLDS

PER HYPHENATION: PREC=0.981 RECALL=0.969 F(1)=0.975

PER TAGGING DECISION: ACC=0.9884 #DECISIONS= 524910

WHOLE WORD: ACCURACY=0.949 DEV=0.002

These results are substantially different than the two previous sets (forward and interpolating), in that the precision is much higher at 0.981 versus 0.975. It effectively eliminates .006/.025 = 24% of the spurious hyphens suggested by the system. Whole word accuracy drops from 0.956 to 0.949, but the per-tagging accuracy remains roughly the same. Recall also drops from 0.976 to 0.969.

Running in unioning mode, we push the precision/recall balance in the opposite direction, which is not so useful other than as a theoretical exercise:

ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=256 VERBOSE=false MODE=UNIONING

Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-english-hyphens.txt
     #words=71476
     #hyphens=121794
     #hyphen insertion points=524910
FOLD = 0 ACCURACY=0.945
FOLD = 1 ACCURACY=0.946
FOLD = 2 ACCURACY=0.949
FOLD = 3 ACCURACY=0.950
FOLD = 4 ACCURACY=0.948
FOLD = 5 ACCURACY=0.946
FOLD = 6 ACCURACY=0.951
FOLD = 7 ACCURACY=0.943
FOLD = 8 ACCURACY=0.949
FOLD = 9 ACCURACY=0.945

OVERALL ACCURACIES ACROSS FOLDS

PER HYPHENATION: PREC=0.968 RECALL=0.981 F(1)=0.974

PER TAGGING DECISION: ACC=0.9880 #DECISIONS= 524910

WHOLE WORD: ACCURACY=0.947 DEV=0.002

Here we have recall at 0.981, which is higher than our original 0.976, at the cost of a few points of precision. As with the precision-balanced system, overall accuracy and per-tagging accuracy also suffer compared to the simple forward or the forward/backward interpolating system.

Parameter Tuning

Like most of the models based on character language models, hyphenation is relatively insensitive to the available tuning parameters.

N-gram Length

First, let's look at n-gram length and interpolation ratio for some reasonable values. The following table contains reports for the simple forward model, evaluated over n-grams from 4 to 10 and interpolation ratios from 1/8 the n-gram length to 4 times the n-gram length.

NInterpAccPrecRecallF
40.30.8190.8890.8860.887
40.50.8190.8890.8850.887
41.00.8190.8890.8850.887
42.00.8190.8890.8840.887
44.00.8170.8890.8830.886
48.00.8140.8880.8800.884
416.00.8060.8870.8740.880
50.30.9090.9450.9470.946
50.60.9100.9460.9470.946
51.30.9090.9460.9450.945
52.50.9070.9450.9440.944
55.00.9030.9430.9410.942
510.00.8950.9390.9350.937
520.00.8810.9310.9250.928
60.40.9440.9680.9700.969
60.80.9450.9680.9690.969
61.50.9440.9680.9690.968
63.00.9440.9670.9670.967
66.00.9400.9650.9650.965
612.00.9330.9610.9600.961
624.00.9210.9550.9520.954
70.40.9530.9730.9740.974
70.90.9530.9730.9740.974
71.80.9530.9740.9740.974
73.50.9530.9730.9740.974
77.00.9520.9720.9720.972
714.00.9460.9690.9690.969
728.00.9370.9640.9630.964
NInterpAccPrecRecallF
80.50.9540.9750.9750.975
81.00.9550.9750.9750.975
82.00.9560.9760.9760.976
84.00.9560.9750.9760.975
88.00.9550.9750.9750.975
816.00.9510.9720.9720.972
832.00.9420.9680.9670.967
90.60.9540.9750.9740.974
91.10.9550.9750.9750.975
92.30.9560.9760.9750.976
94.50.9580.9760.9760.976
99.00.9560.9750.9750.975
918.00.9520.9730.9730.973
936.00.9440.9690.9680.968
100.60.9530.9740.9730.974
101.30.9540.9750.9740.975
102.50.9560.9760.9760.976
105.00.9580.9760.9760.976
1010.00.9570.9760.9760.976
1020.00.9520.9730.9730.973
1040.00.9440.9690.9680.969

The higher the interpolation ratio, the more smoothing is applied to higher-order character n-gram estimates.

The bold-faced entries are maximal values for their categories. Thus the highest achieved F-measure is 0.976, and the highest accuracy found was 0.958. The most important thing to notice here is that the response of the scores to the parameters is relatively smooth and insensitive to small changes. That is, 9-grams perform pretty much like 8-grams, and an interpolation of 4.0 performs pretty much like 8.0.

Our default recommendation for English is to use 8-grams with the default interpolation ratio of 8.0. That row is highlighted in yellow. The runs in the body of the tutorial use 8-grams with an interpolation ratio of 4.0.

Dutch Hyphenation

We run Dutch the same way as English.

The Dutch Corpus

Here's the ant target to create the corpus:

> ant -Dcelex.ow.file=e:\data\celex2\unpacked\celex2\dutch\dow\dow.cd -Dhyphens.file=celex-dutch-hyphens.txt celex-hyphens
# hyphenations=315957
# unique words=315897
[tim er, ti mer]
[lijs ten, lij sten]
[made, ma de]
[mi se, mise]
...
[story, sto ry]
[an ne, anne]
[wets taal, wet staal]
[coun try, country]
[raket, ra ket]
RESIDUAL ESCAPES=[]

Some of those duplicate hyphenations look highly suspect.

Dutch Cross-Validation

Here's the ant target to evaluate using cross-validation:

> ant -Dverbose=false -Dhyphens.file=celex-dutch-hyphens.txt xval-hyphens

Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-dutch-hyphens.txt
     #words=315957
     #hyphens=811045
     #hyphen insertion points=3115724
ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=false


NGRAM=  8 INTERP= 4.0 ACC=0.994 P=0.998 R=0.998 F=0.998

OVERALL ACCURACIES ACROSS FOLDS

PER HYPHENATION: PREC=0.998 RECALL=0.998 F(1)=0.998

PER TAGGING DECISION: ACC=0.9990 #DECISIONS=3115724

WHOLE WORD: ACCURACY=0.994 DEV=0.001

First note that there's almost five times as many words in the Dutch dictionary than in the English one. There's over 300,000 unique Dutch hyphenations in the corpus.

It looks like hyphenation in Dutch is a whole lot easier than in English. In fact, it's nearly perfect, with a mistake being made only every 200th word or 500th hyphenation point.

German Hyphenation

German Corpus Special Cases

Just to keep us on our toes, the German data's in a slightly different format, with only a single hyphenation (no ambiguities) listed in column 5 of their data. And it also contains an escape for the Eskett (ß) character.

So we cut-and-pasted and modified to produce a German corpus builder, src/CelexHyphenCorpusGerman.java. It runs just like the other ones, though from its own ant target celex-hyphens-german:

> ant -Dcelex.ow.file=e:\data\celex2\unpacked\celex2\german\gow\gow.cd -Dhyphens.file=celex-german-hyphens.txt celex-hyphens-german
# hyphenations=312215
# unique words=312128
[ras tet, ra stet]
[ra sten, ras ten]
[be stem, bes tem]
...
[in di gnier test, in dig nier test]
[gro tes ke, gro te ske]

RESIDUAL ESCAPES=[]

German Cross-Validation

We don't need to do anything special for evaluating German with cross-validation:

> ant -Dverbose=false -Dhyphens.file=celex-german-hyphens.txt xval-hyphens

Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-german-hyphens.txt
     #words=312208
     #hyphens=793598
     #hyphen insertion points=3164400
ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=false

NGRAM=  8 INTERP= 4.0 ACC=0.997 P=0.999 R=0.999 F=0.999

OVERALL ACCURACIES ACROSS FOLDS

PER HYPHENATION: PREC=0.999 RECALL=0.999 F(1)=0.999

PER TAGGING DECISION: ACC=0.9996 #DECISIONS=3164400

WHOLE WORD: ACCURACY=0.997 DEV=0.000

German's even eaiser to hyphenate than English. We didn't run multiple parameters here, but it's easy to imagine longer contexts might help for German, because of the length of some of the compounds in the dictionary.

English Syllabification

Next, we turn to English syllabification. Syllabification is essentially the same problem as hyphenation, only with a different set of characters.

Extracting the Syllables

To extract a training corpus, we have a separate program src/CelexSyllableCorpus.java. It's basically just like the hyphen corpus extractor, only with a different separator and different set of characters.

English syllables can be extracted using the ant target celex-syllables:

> ant -Dcelex.pw.file=e:\data\celex2\unpacked\celex2\english\epw\epw.cd -Dcelex.sylls.file=celex-english-syllables.txt celex-syllables

# syllabifications=108224
# unique words=108123
[I nO: gjU r+l, In O: gjU r+l]
[nEs lIN, nE slIN]
[In O: gjU rl,, I nO: gjU rl,]
...
[pOp + d+mz, pO p+ d+mz]
[pOn j+dz, pO nj+dz]
[n& tSr+ lIst, n& tSr+l Ist]

There are many more syllabifications than there were hyphenations. This is primarily due to there being multiple pronunciations for words.

The phonetic symbols are encoded in a CELEX-specific encoding. Some of the symbols represent length, as in the character ':', and some of them indicate syllabic consonants, and so on. There's a full description in the annotation guides reference above. Rather than replacing the multi-character symbols with their IPA equivalents, we just left them as is, which probably hurts accuracy somewhat.

We had to change the input symbols for the syllabification task because we are using the at-sign (@) as a boundary in the hyphenation program. We just replaced it with a plus sign ('+'), which doesn't conflict with other symbols.

Performing Syllabification

Because we output the data in the same format as for hyphenation, we do not need to recode the evaluator. We run the standard forward evaluation from the ant target xval-hyphens. The bidirectional and intersecting targets may be invoked in the same way.

> ant -Dhyphens.file=celex-english-syllables.txt -Dverbose=false xval-hyphens

Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-english-syllables.txt
     #words=108224
     #hyphens=218187
     #hyphen insertion points=689124
ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=false

NGRAM=  8 INTERP= 4.0 ACC=0.988 P=0.995 R=0.994 F=0.995

OVERALL ACCURACIES ACROSS FOLDS

PER HYPHENATION: PREC=0.995 RECALL=0.994 F(1)=0.995

PER TAGGING DECISION: ACC=0.9966 #DECISIONS= 689124

WHOLE WORD: ACCURACY=0.988 DEV=0.001

As expected, syllabification is a whole lot more predicatable than hyphenation in English. That's because the sound system used by the language is much more regular than the spelling.