Stanford Word Segmenter
| |
Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.
The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications.
The system requires Java 1.6+ to be installed. We recommend at least 1G of memory for documents that contain long sentences. For files with shorter sentences (e.g., 20 tokens), decrease the memory requirement by changing the option
java -mx1g
in the run scripts.Arabic
Arabic is a root-and-template language with abundant bound morphemes. These morphemes include possessives, pronouns, and discourse connectives. Segmenting bound morphemes reduces lexical sparsity and simplifies syntactic analysis.
The Arabic segmenter model processes raw text according to the Penn Arabic Treebank 3 (ATB) standard. It is a stand-alone implementation of the segmenter described in:
Spence Green and John DeNero. 2012. . In ACL.
Chinese
Chinese is standardly written without spaces between words (as are some
other languages). This software will split Chinese text into a sequence of words, defined according to some word segmentation standard. It is a Java implementation of the CRF-based Chinese Word Segmenter described in:Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. 2005. . In Fourth SIGHAN Workshop on Chinese Language Processing.
Two models with two different segmentation standards are included:
and.On May 21, 2008, we released a version that makes use of lexicon
features. With external lexicon features, the segmenter segments more consistently and also achieves higher F measure when we train and test on the bakeoff data. This version is close to the CRF-Lex segmenter described in:Pi-Chuan Chang, Michel Galley and Chris Manning. 2008. . In WMT.
The older version (2006-05-11) without using external lexicon features
will still be available for download, but we do recommend using the latest version.Another new feature of the latest release is that the segmenter can now output k-best segmentations. is now also available.
The segmenter is available for download,
licensed under the (v2 or later). Source is included. The package includes components for command-line invocation and a Java API. The segmenter code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under the full GPL, which allows many free uses. For distributors of, commercial licensing with a is available. If you don't need a commercial license, but would like to support maintenance of these tools, we welcome gift funding.The download is a zipped file consisting of
model files, compiled code, and source files. If you unpack the tar file, you should have everything needed. Simple scripts are included to invoke the segmenter.
We have 3 mailing lists for the Stanford Word Segmenter, all of which are shared
with other JavaNLP tools (with the exclusion of the parser). Each address is at@lists.stanford.edu
:
java-nlp-user
This is the best list to post to in order to ask questions, make announcements, or for discussion among JavaNLP users. You have to subscribe to be able to use it. Join the list via or by emailingjava-nlp-user-join@lists.stanford.edu
. (Leave the subject and message body empty.) You can also.java-nlp-announce
This list will be used only to announce new versions of Stanford JavaNLP tools. So it will be very low volume (expect 1-3 message a year). Join the list via via or by emailingjava-nlp-announce-join@lists.stanford.edu
. (Leave the subject and message body empty.)java-nlp-support
This list goes only to the software maintainers. It's a good address for licensing questions, etc. For general use and support questions, please join and usejava-nlp-user
. You cannot joinjava-nlp-support
, but you can mail questions tojava-nlp-support@lists.stanford.edu
.