合法化java [关闭]
java
nlp
4
0

我正在寻找Java中英语的lemmatisation实现。我已经找到了一些,但是我需要的东西不需要太多内存即可运行(最高1 GB)。谢谢。我不需要词干。

参考资料:
Stack Overflow
收藏
评论
共 2 个回答
高赞 时间 活跃

Stanford CoreNLP Java库包含一个lemmatizer ,它需要占用大量资源,但是我已经在内存小于512MB的笔记本电脑上运行了它。

要使用它:

  1. 下载jar文件 ;
  2. 在选择的编辑器中创建一个新项目/制作一个ant脚本,其中包括刚下载的档案中包含的所有jar文件;
  3. 创建一个新的Java,如下所示(基于Stanford网站上的代码段);
import java.util.Properties;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");

        // StanfordCoreNLP loads a lot of models, so you probably
        // only want to do this once per execution
        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();

        // create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);

        // run all Annotators on this text
        this.pipeline.annotate(document);

        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }

        return lemmas;
    }
}
收藏
评论

克里斯对斯坦福脱胶机的回答很棒!简直美极了。他甚至包括一个指向jar文件的指针,因此我不必在Google上搜索它。

但是他的其中一行代码存在语法错误(他以某种方式在以“ lemmas.add ...”开头的行中切换了结尾的右括号和分号),但他忘记了包括导入内容。

至于NoSuchMethodError错误,通常是由于该方法没有被设为公共静态而引起的,但是如果您查看代码本身(位于http://grepcode.com/file/repo1.maven.org/maven2/com.guokr) /stan-cn-nlp/0.0.2/edu/stanford/nlp/util/Generics.java?av=h ),这不是问题。我怀疑问题出在构建路径中(我正在使用Eclipse Kepler,因此配置我在项目中使用的33个jar文件没有问题)。

以下是我对Chris代码的较小更正,并附有一个示例(我对Evanescence屠杀其完美歌词深表歉意):

import java.util.LinkedList;
import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");

        /*
         * This is a pipeline that takes in a string and returns various analyzed linguistic forms. 
         * The String is tokenized via a tokenizer (such as PTBTokenizerAnnotator), 
         * and then other sequence model style annotation can be used to add things like lemmas, 
         * POS tags, and named entities. These are returned as a list of CoreLabels. 
         * Other analysis components build and store parse trees, dependency graphs, etc. 
         * 
         * This class is designed to apply multiple Annotators to an Annotation. 
         * The idea is that you first build up the pipeline by adding Annotators, 
         * and then you take the objects you wish to annotate and pass them in and 
         * get in return a fully annotated object.
         * 
         *  StanfordCoreNLP loads a lot of models, so you probably
         *  only want to do this once per execution
         */
        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();
        // Create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);
        // run all Annotators on this text
        this.pipeline.annotate(document);
        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the
                // list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }
        return lemmas;
    }


    public static void main(String[] args) {
        System.out.println("Starting Stanford Lemmatizer");
        String text = "How could you be seeing into my eyes like open doors? \n"+
                "You led me down into my core where I've became so numb \n"+
                "Without a soul my spirit's sleeping somewhere cold \n"+
                "Until you find it there and led it back home \n"+
                "You woke me up inside \n"+
                "Called my name and saved me from the dark \n"+
                "You have bidden my blood and it ran \n"+
                "Before I would become undone \n"+
                "You saved me from the nothing I've almost become \n"+
                "You were bringing me to life \n"+
                "Now that I knew what I'm without \n"+
                "You can've just left me \n"+
                "You breathed into me and made me real \n"+
                "Frozen inside without your touch \n"+
                "Without your love, darling \n"+
                "Only you are the life among the dead \n"+
                "I've been living a lie, there's nothing inside \n"+
                "You were bringing me to life.";
        StanfordLemmatizer slem = new StanfordLemmatizer();
        System.out.println(slem.lemmatize(text));
    }

}

这是我的结果(给我留下了深刻的印象;它有时将“ s”识别为“ is”,并且几乎完成了所有其他操作):

启动斯坦福脱胶机

添加注释器标记化

添加注释器拆分

添加注释器pos

从edu / stanford / nlp / models / pos-tagger / english-left3words / english-left3words-distsim.tagger中读取POS标记器模型...完成[1.7秒]。

添加注释器引理

[如何,您可以看到我的眼睛,像是打开门,?,您引导我向下,进入我的核心,我在哪里变得麻木没有灵魂,我的灵魂,在某个地方睡觉,冷,直到你找到那里,然后带领它,回到家,你,醒来,我,向上,在里面,叫我的名字,然后保存,我来自黑暗,你拥有,出价,我的血液,然后运行,在我消失之前,撤消,你,保存我,从,什么都没有,我几乎已经变成了,成为,带给我生命,现在,那个,我,知道了,什么,我,没有,你,可以,拥有,公正,离开,我,你,呼吸,进入,我,并使我,真实,冰冷,里面,没有,你,触摸,没有,你,爱,亲爱的,只有,你,是,生活,在其中,死者,我,有,活着,一个,说谎,,,那里,什么都没有,在里面,你,成为,带给我,生活,。

收藏
评论
新手导航
  • 社区规范
  • 提出问题
  • 进行投票
  • 个人资料
  • 优化问题
  • 回答问题

关于我们

常见问题

内容许可

联系我们

@2020 AskGo
京ICP备20001863号