如何使用斯坦福解析器将文本拆分为句子?
artificial-intelligence
java
nlp
stanford-nlp
5
0

如何使用Stanford解析器将文本或段落拆分为句子?

是否有任何方法可以提取句子,例如为Ruby提供的getSentencesFromString()

参考资料:
Stack Overflow
收藏
评论
共 5 个回答
高赞 时间 活跃

您可以检查DocumentPreprocessor类。以下是一个简短的摘要。我认为可能还有其他方式可以做您想要的事情。

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();

for (List<HasWord> sentence : dp) {
   // SentenceUtils not Sentence
   String sentenceString = SentenceUtils.listToString(sentence);
   sentenceList.add(sentenceString);
}

for (String sentence : sentenceList) {
   System.out.println(sentence);
}
收藏
评论

使用Stanford CoreNLP版本3.6.0或3.7.0提供的简单API

这是3.6.0的示例。它与3.7.0完全相同。

Java代码段

import java.util.List;

import edu.stanford.nlp.simple.Document;
import edu.stanford.nlp.simple.Sentence;
public class TestSplitSentences {
    public static void main(String[] args) {
        Document doc = new Document("The text paragraph. Another sentence. Yet another sentence.");
        List<Sentence> sentences = doc.sentences();
        sentences.stream().forEach(System.out::println);
    }
}

产量:

文字段落。

另一句话。

又一句话。

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>stanfordcorenlp</groupId>
    <artifactId>stanfordcorenlp</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
    </properties>

    <dependencies>
        <!-- https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp -->
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.6.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/com.google.protobuf/protobuf-java -->
        <dependency>
            <groupId>com.google.protobuf</groupId>
            <artifactId>protobuf-java</artifactId>
            <version>2.6.1</version>
        </dependency>
    </dependencies>
</project>
收藏
评论

我知道已经有一个可以接受的答案了……但是通常您只是从带注释的文档中获取SentenceAnnotations。

// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution 
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// read some text in the text variable
String text = ... // Add your text here!

// create an empty Annotation just with the given text
Annotation document = new Annotation(text);

// run all Annotators on this text
pipeline.annotate(document);

// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

for(CoreMap sentence: sentences) {
  // traversing the words in the current sentence
  // a CoreLabel is a CoreMap with additional token-specific methods
  for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
    // this is the text of the token
    String word = token.get(TextAnnotation.class);
    // this is the POS tag of the token
    String pos = token.get(PartOfSpeechAnnotation.class);
    // this is the NER label of the token
    String ne = token.get(NamedEntityTagAnnotation.class);       
  }

}

来源-http://nlp.stanford.edu/software/corenlp.shtml (下半部分)

而且,如果您仅查找句子,则可以从管道初始化中删除诸如“ parse”和“ dcoref”之类的后续步骤,这将节省一些加载和处理时间。摇滚乐。 〜K

收藏
评论

接受的答案有几个问题。首先,分词器将某些字符(例如字符“”转换为两个字符“”)。其次,将带标记的文本与空格重新连接在一起不会返回与以前相同的结果。因此,来自已接受答案的示例文本将以非平凡的方式转换输入文本。

但是,令牌化程序使用的CoreLabel类会跟踪它们映射到的源字符,因此,如果有原始字符串,则重建合适的字符串很简单。

下面的方法1显示了可接受的答案方法,方法2显示了我的方法,它克服了这些问题。

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";

List<String> sentenceList;

/* ** APPROACH 1 (BAD!) ** */
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
sentenceList = new ArrayList<String>();
for (List<HasWord> sentence : dp) {
    sentenceList.add(Sentence.listToString(sentence));
}
System.out.println(StringUtils.join(sentenceList, " _ "));

/* ** APPROACH 2 ** */
//// Tokenize
List<CoreLabel> tokens = new ArrayList<CoreLabel>();
PTBTokenizer<CoreLabel> tokenizer = new PTBTokenizer<CoreLabel>(new StringReader(paragraph), new CoreLabelTokenFactory(), "");
while (tokenizer.hasNext()) {
    tokens.add(tokenizer.next());
}
//// Split sentences from tokens
List<List<CoreLabel>> sentences = new WordToSentenceProcessor<CoreLabel>().process(tokens);
//// Join back together
int end;
int start = 0;
sentenceList = new ArrayList<String>();
for (List<CoreLabel> sentence: sentences) {
    end = sentence.get(sentence.size()-1).endPosition();
    sentenceList.add(paragraph.substring(start, end).trim());
    start = end;
}
System.out.println(StringUtils.join(sentenceList, " _ "));

输出:

My 1st sentence . _ `` Does it work for questions ? '' _ My third sentence .
My 1st sentence. _ “Does it work for questions?” _ My third sentence.
收藏
评论

使用.net C#软件包:这将拆分句子,使括号正确,并保留原始空格和标点符号:

public class NlpDemo
{
    public static readonly TokenizerFactory TokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(),
                "normalizeParentheses=false,normalizeOtherBrackets=false,invertible=true");

    public void ParseFile(string fileName)
    {
        using (var stream = File.OpenRead(fileName))
        {
            SplitSentences(stream);
        }
    }

    public void SplitSentences(Stream stream)
    {            
        var preProcessor = new DocumentPreprocessor(new UTF8Reader(new InputStreamWrapper(stream)));
        preProcessor.setTokenizerFactory(TokenizerFactory);

        foreach (java.util.List sentence in preProcessor)
        {
            ProcessSentence(sentence);
        }            
    }

    // print the sentence with original spaces and punctuation.
    public void ProcessSentence(java.util.List sentence)
    {
        System.Console.WriteLine(edu.stanford.nlp.util.StringUtils.joinWithOriginalWhiteSpace(sentence));
    }
}

输入: -这句话的字符具有一定的魅力,通常在标点符号和散文中都可以找到。这是第二句话?确实是。

输出: 3个句子(“?”被认为是句尾定界符)

注意:对于这样的句子:“在所有方面,哈维沙姆夫人的课无懈可击(据人们所知!)。”令牌生成器将正确识别出Mrs.末尾的句点不是EOS,但是它将错误地标记!。在括号内作为EOS并“在所有方面”分开。作为第二句话。

收藏
评论
新手导航
  • 社区规范
  • 提出问题
  • 进行投票
  • 个人资料
  • 优化问题
  • 回答问题

关于我们

常见问题

内容许可

联系我们

@2020 AskGo
京ICP备20001863号