博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
The Stanford NLP (Natural Language Processing) Group
阅读量:6973 次
发布时间:2019-06-27

本文共 4119 字,大约阅读时间需要 13 分钟。

 

Stanford Word Segmenter

 

 

 

|
|

 

 

Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.

The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications.

The system requires Java 1.6+ to be installed. We recommend at least 1G of memory for documents that contain long sentences. For files with shorter sentences (e.g., 20 tokens), decrease the memory requirement by changing the option java -mx1g in the run scripts.

Arabic

Arabic is a root-and-template language with abundant bound morphemes. These morphemes include possessives, pronouns, and discourse connectives. Segmenting bound morphemes reduces lexical sparsity and simplifies syntactic analysis.

The Arabic segmenter model processes raw text according to the Penn Arabic Treebank 3 (ATB) standard. It is a stand-alone implementation of the segmenter described in:

Spence Green and John DeNero. 2012. . In ACL.

Chinese

Chinese is standardly written without spaces between words (as are some

other languages). This software will split Chinese text into a sequence
of words, defined according to some word segmentation standard.
It is a Java implementation of the CRF-based Chinese Word Segmenter
described in:

Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. 2005. . In Fourth SIGHAN Workshop on Chinese Language Processing.

Two models with two different segmentation standards are included:

and
.

On May 21, 2008, we released a version that makes use of lexicon

features. With external lexicon features, the segmenter segments more
consistently and also achieves higher F measure when we train and test
on the bakeoff data. This version is close to the CRF-Lex segmenter described in:

Pi-Chuan Chang, Michel Galley and Chris Manning. 2008. . In WMT.

The older version (2006-05-11) without using external lexicon features

will still be available for download, but we do recommend using the
latest version.

Another new feature of the latest release is that the segmenter can now output k-best segmentations. is now also available.

The segmenter is available for download,

licensed under the (v2 or later). Source is included.
The package includes components for command-line invocation and a Java API.
The segmenter
code is dual licensed (in a similar manner to MySQL, etc.).
Open source licensing is under the full GPL,
which allows many free uses.
For distributors of
, commercial licensing with a
is available.
If you don't need a commercial license, but would like to support
maintenance of these tools, we welcome gift funding.

The download is a zipped file consisting of

model files, compiled code, and source files. If you unpack the tar file,
you should have everything needed. Simple scripts are included to
invoke the segmenter.

We have 3 mailing lists for the Stanford Word Segmenter, all of which are shared

with other JavaNLP tools (with the exclusion of the parser). Each address is
at @lists.stanford.edu:

  1. java-nlp-user This is the best list to post to in order
    to ask questions, make announcements, or for discussion among JavaNLP
    users. You have to subscribe to be able to use it.
    Join the list via or by emailing
    java-nlp-user-join@lists.stanford.edu. (Leave the
    subject and message body empty.) You can also
    .
  2. java-nlp-announce This list will be used only to announce
    new versions of Stanford JavaNLP tools. So it will be very low volume (expect 1-3
    message a year). Join the list via via or by emailing
    java-nlp-announce-join@lists.stanford.edu. (Leave the
    subject and message body empty.)
  3. java-nlp-support This list goes only to the software
    maintainers. It's a good address for licensing questions, etc. For
    general use and support questions, please join and use
    java-nlp-user.
    You cannot join java-nlp-support, but you can mail questions to
    java-nlp-support@lists.stanford.edu.

转载地址:http://kzrsl.baihongyu.com/

你可能感兴趣的文章
binlog2sql参考
查看>>
CentOS 系统启动流程
查看>>
PXE 装机详解
查看>>
css 实现单行文本 多行文本垂直居中
查看>>
JS判断提交表单不能为空 等的验证
查看>>
设计模式----观察者模式UML和实现代码(5个必须掌握的设计模式)
查看>>
我的友情链接
查看>>
mongodb复制集部署文档
查看>>
ElasticSearch多节点模式的搭建
查看>>
docker基础镜像制作
查看>>
合格的网站运营人员十大要求
查看>>
再学C++ Primer(10)-面向对象编程
查看>>
Real-Rime Rendering (6) - 多边形技术(Polygonl Techniques)
查看>>
Quartz基本使用(一)
查看>>
Docker Registry Server 搭建,配置免费HTTPS证书,及拥有权限认证的私有仓库
查看>>
使用CXF处理JavaBean式的复合类型和List集合类型的形参和返回值
查看>>
C语言--求两个数的最大公约数
查看>>
中老年人补钙时的误区
查看>>
逐步熟悉vim命令
查看>>
Unhandled event loop exception No more handles [Could not detect registered XULR
查看>>