About the Chinese Word Demo
Chinese is written without spaces between the words. This demo illustrates how to break Chinese into words.
Annotation Standard
Because the notion of word is not a natural one in Chinese, different "standards" have been developed for determining wordhood. This demo is based on the Academia Sinica corpus as distributed as part of SIGHAN 2005.
LingPipe's Word Segmentation
LingPipe's word segmentation is performed using a noisy channel model, where the source is a character language model trained on segmented text and the channel simply allows spaces to be freely inserted.
Model Quality
The demo uses a 4-gram character language model for the source of the noisy channel. This model is roughly 30MB in compiled form. 5-grams perform significantly better (about 25% fewer errors in the SIGHAN evaluation), but the model is 80MB.
Sentences
The training data provided a single sentence per line, so this segmenter assumes that sentences are already marked. LingPipe does not yet have a Chinese sentence detector, so this is left to the user. Performance should be OK even without sentences marked.
Chinese Word XML Markup
This demo only provides first-best output, with words wrapped
in token elements.
Chinese Word Demo on the Web
The demo is hosted on the web at the following URL:
Academia Sinica Corpus
http://lingpipe-demos.com:8080/lingpipe-demos/word_zh_as/textInput.html
For detailed information about using web demos, including web form, file upload and web service instructions, see the web demo instructions
Chinese Word Demo via GUI
To launch the demo in a GUI, first change directories to the command directory and then invoke the demo batch script. Note: Parameters are set in the GUI, not as arguments to the launch script.
Windows Operating System
Academia Sinica Corpus
> cd %LINGPIPE_HOME%\demos\generic\bin > gui_word_zh_as.bat
Unix-like Operating Systems
Academia Sinica Corpus
> cd %LINGPIPE_HOME%\demos\generic\bin > sh gui_word_zh_as.sh
For detailed information about running demos in a GUI, see the GUI demo instructions
Chinese Word Demo via Shell Command
Shell commands may be run over single files, all of the files in a directory, or using standard input/output.
Running over a Directory
Academia Sinica Corpus
> cd $LINGPIPE/demos/generic/bin > cmd_word_zh_as.bat -inDir=../../data/testdir -outDir=/testout
Running a Single File
Academia Sinica Corpus
> cd $LINGPIPE/demos/generic/bin > cmd_word_zh_as.bat -inFile=../../data/testdir/foo.txt -outFile=foo.out.xml
The other genres are handled the same way,
with different suffixes in place of general_brown.
Running through a Pipe (Standard input/output)
Academia Sinica Corpus
> cd demos/generic/bin > echo See Spot. See Spot run. | cmd_word_zh_as.bat
Running in Unix-like Operating Systems
For unix-like operating systems such as Unix, Solaris, Linux, or Macintosh OS X:
- Replace path backward slashes
(
\) with forward slashes (/), and - substitute
.shfor the.batsuffix in the command.
For detailed information about running demos from the command line, see the command line demo instructions
Chinese Word Demo Scripts
The following scripts are available in
$LINGIPE/demos/generic/bin for running the demo. Note
that each script comes in four flavors, distinguishing
command line from GUI, and the Windows DOS shell from the Unix shell.
| Genre | Corpus | Mode | Windows DOS | Unix/Linux/Mac sh |
|---|---|---|---|---|
| General | Academia Sinica | Command | cmd_word_zh_as.bat |
cmd_word_zh_as.sh |
| GUI | gui_word_zh_as.bat |
gui_word_zh_as.sh |
Chinese Word Demo Parameters
The following is a complete list of parameters for the demo.
General Demo Parameters
These parameters apply to every version (web/GUI/command) of every demo.
| Parameter | Description | Usage Constraints |
|---|---|---|
inCharset |
Input character set | Optional. Defaults to platform default. |
outCharset |
Output character set | |
contentType |
Input content type | May be one of:
text/plain. |
removeElts |
Element tags to remove | Optional. May only be used with contentType=text/html
or contentType=text/xml. Each value may be
comma-separated list. If neither of these are
specified, all text content is processed. |
includeElts |
Elements to annotate |
Command-Line Only Parameters
These parameters apply to every command-line demo, but are not relevant for the GUI or web versions of the demos.
| Parameter | Description | Usage Constraints |
|---|---|---|
inFile |
Readable input file | May not be used with inDir.
If either is not specified, defaults to standard input or output. |
outFile |
Writeable output file | |
inDir |
Readable input directory | May not be used with inFile or outFile.
If used, inDir and outDir must both be specified. |
outDir |
Writeable output directory | |