NLP学习——CoreNLP的下载使用

内容分享2个月前发布

1 0 0

在这一部分，我们将介绍如何开始使用CoreNLP以及其不同的使用模式。您可以通过命令行、Java代码或者对服务器的调用来使用CoreNLP。并且，CoreNLP支持多种语言，包括阿拉伯语、中文、英语、法语、德语和西班牙语。

Getting a copy

你可以通过下面的链接下载Stanford CoreNLP。

https://stanfordnlp.github.io/CoreNLP/download.html

这将下载一个较大的（482MB）zip文件，其中包含：

CoreNLP代码jar
CoreNLP模型jar（大多数任务需要在您的类路径中）
运行CoreNLP所需的库
项目的文档/源代码。

这些就是开始使用英语版CoreNLP所需的所有东西！解压这个文件，打开结果文件夹，您就可以开始使用它了。

其他语言：如果你想使用Stanford CoreNLP处理其他（人类）语言，你需要额外的模型文件。我们提供了多种语言的模型文件，还有更多针对英语的模型文件，包括处理非标准大写英语的模型（即在文本或电报中不一般大写的英语）。你可以在下面的表格中找到最新的模型。早期版本的模型可以在发布历史页面找到。

Language	Model Jar	Version
Arabic	https://search.maven.org/remotecontent?filepath=edu/stanford/nlp/stanford-corenlp/4.4.0/stanford-corenlp-4.4.0-models-arabic.jar	4.5.4
Chinese	https://search.maven.org/remotecontent?filepath=edu/stanford/nlp/stanford-corenlp/4.4.0/stanford-corenlp-4.4.0-models-chinese.jar	4.5.4
English (extra)	https://search.maven.org/remotecontent?filepath=edu/stanford/nlp/stanford-corenlp/4.4.0/stanford-corenlp-4.4.0-models-english.jar	4.5.4
English (KBP)	https://search.maven.org/remotecontent?filepath=edu/stanford/nlp/stanford-corenlp/4.4.0/stanford-corenlp-4.4.0-models-english-kbp.jar	4.5.4
French	https://search.maven.org/remotecontent?filepath=edu/stanford/nlp/stanford-corenlp/4.4.0/stanford-corenlp-4.4.0-models-french.jar	4.5.4
German	https://search.maven.org/remotecontent?filepath=edu/stanford/nlp/stanford-corenlp/4.4.0/stanford-corenlp-4.4.0-models-german.jar	4.5.4
Hungarian	https://search.maven.org/remotecontent?filepath=edu/stanford/nlp/stanford-corenlp/4.4.0/stanford-corenlp-4.4.0-models-hungarian.jar	4.5.4
Italian	https://search.maven.org/remotecontent?filepath=edu/stanford/nlp/stanford-corenlp/4.4.0/stanford-corenlp-4.4.0-models-italian.jar	4.5.4
Spanish	https://search.maven.org/remotecontent?filepath=edu/stanford/nlp/stanford-corenlp/4.4.0/stanford-corenlp-4.4.0-models-spanish.jar	4.5.4

如果你想修改源代码并重新编译文件，请参阅这些说明。之前的版本可以在发布历史页面上找到。

GitHub: https://github.com/stanfordnlp/CoreNLP

Maven：你可以在Maven Central上找到Stanford CoreNLP。最关键的一点是CoreNLP需要它的模型才能运行（除了分词器和句子分割器之外的大部分），所以你需要在你的pom.xml中指定代码库和模型库，如下所示：（注意：Maven的发布一般在网站发布几天后进行。）

<dependencies>
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>4.4.0</version>
</dependency>
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>4.4.0</version>
    <classifier>models</classifier>
</dependency>
</dependencies>

如果你想从Maven中获取阿拉伯语、中文、德语或西班牙语的语言模型jar包，也需要在你的pom.xml的依赖项中添加以下内容：

<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>4.4.0</version>
    <classifier>models-chinese</classifier>
</dependency>

将“models-chinese”替换为“models-english”、“models-english-kbp”、“models-arabic”、“models-french”、“models-german”或“models-spanish”以获得其他语言的资源！

有一个使用Stanford CoreNLP的示例Maven项目，可以在GitHub发布版本中找到。

它位于
examples/sample-maven-project目录下。

你可以使用这个命令来构建项目：

mvn compile

你可以像这样用一个命令运行一个演示程序：

export MAVEN_OPTS="-Xmx14000m"
mvn exec:java -Dexec.mainClass="edu.stanford.nlp.StanfordCoreNLPEnglishTestApp"

从官方版本设置的步骤

这个示例解释了如何从最新的官方版本设置CoreNLP。这个示例将引导你下载包，并运行一个简单的CoreNLP命令行调用。

准备：

Java 8。通过命令java -version可以查询。输出结果： java version “1.8.0_92”
Zip工具
为了完全按照下面的步骤进行：需要bash或类似的shell，以及wget或类似的下载器。

步骤：

下载 CoreNLP 压缩包。
http://stanfordnlp.github.io/CoreNLP/index.html#download:

wget http://nlp.stanford.edu/software/stanford-corenlp-latest.zip

或者用 curl (what you get by default on macOS):

curl -O -L http://nlp.stanford.edu/software/stanford-corenlp-latest.zip

解压此版本：

unzip stanford-corenlp-latest.zip

进入解压后的文件夹：

cd stanford-corenlp-4.5.4

设置你的classpath。如果你在使用IDE，你应该在IDE中设置类路径。如果你在使用bash或类似的shell，下面的方法会起作用。

for file in `find . -name "*.jar"`; do export
CLASSPATH="$CLASSPATH:`realpath $file`"; done

如果您常常使用CoreNLP，那么在您的 ~/.bashrc（或等效的）文件中，下面这行代码会很有用，您需要将目录 /path/to/corenlp/ 替换为您解压 CoreNLP 的适当路径：

for file in `find /path/to/corenlp/ -name "*.jar"`; do export
CLASSPATH="$CLASSPATH:`realpath $file`"; done

试试看！例如，下面的操作将创建一个简单的文本文件进行注解，并在该文件上运行CoreNLP。输出将以JSON文件的形式保存到input.txt.out。请注意，默认情况下在默认的注解管道中运行所有的CoreNLP注解器需要相当多的内存。大多数情况下，您应该至少给它3GB的内存（-mx3g）。

echo "the quick brown fox jumped over the lazy dog" > input.txt
java -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file input.txt

从GitHub HEAD版本设置的步骤

准备：

Java 8。通过命令java -version可以查询。输出结果： java version “1.8.0_92”
Apache Ant
Zip工具
为了完全按照下面的步骤进行：需要bash或类似的shell，以及wget或类似的下载器。

步骤：

CoreNLP Git仓库中克隆：

git clone git@github.com:stanfordnlp/CoreNLP.git

进入CoreNLP目录：

cd CoreNLP

将项目构建成一个自包含的jar文件。最简单的方法是用：

ant jar

下载最新的model：

wget http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar

或者用 curl (what you get by default on macOS):

curl -O -L http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar

设置你的classpath。如果你在使用IDE，你应该在IDE中设置类路径。如果你在使用bash或类似的shell，下面的方法会起作用。

export CLASSPATH="$CLASSPATH:javanlp-core.jar:stanford-corenlp-models-current.jar";
for file in `find lib -name "*.jar"`; do export CLASSPATH="$CLASSPATH:`realpath $file`"; done

export CLASSPATH="$CLASSPATH:/path/to/corenlp/javanlp-core.jar:/path/to/corenlp/stanford-corenlp-models-current.jar";
for file in `find /path/to/corenlp/lib -name "*.jar"`; do export CLASSPATH="$CLASSPATH:`realpath $file`"; done

echo "the quick brown fox jumped over the lazy dog" > input.txt
java -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file input.txt