从句子中产生N-gram
java
nlp
6
0

如何生成一个字符串的n元语法,如:

String Input="This is my car."

我想用此输入生成n-gram:

Input Ngram size = 3

输出应为:

This
is
my
car

This is
is my
my car

This is my
is my car

在Java中给出一些想法,如何实现该想法,或者是否有可用的库。

我正在尝试使用此NGramTokenizer,但是它给出了n-gram的字符序列,我想要n-gram的单词序列。

参考资料:
Stack Overflow
收藏
评论
共 3 个回答
高赞 时间 活跃

我相信这会满足您的要求:

import java.util.*;

public class Test {

    public static List<String> ngrams(int n, String str) {
        List<String> ngrams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            ngrams.add(concat(words, i, i+n));
        return ngrams;
    }

    public static String concat(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : ngrams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}

输出:

This
is
my
car.

This is
is my
my car.

This is my
is my car.

实现为迭代器的“按需”解决方案:

class NgramIterator implements Iterator<String> {

    String[] words;
    int pos = 0, n;

    public NgramIterator(int n, String str) {
        this.n = n;
        words = str.split(" ");
    }

    public boolean hasNext() {
        return pos < words.length - n + 1;
    }

    public String next() {
        StringBuilder sb = new StringBuilder();
        for (int i = pos; i < pos + n; i++)
            sb.append((i > pos ? " " : "") + words[i]);
        pos++;
        return sb.toString();
    }

    public void remove() {
        throw new UnsupportedOperationException();
    }
}
收藏
评论

此代码返回给定长度的所有String的数组:

public static String[] ngrams(String s, int len) {
    String[] parts = s.split(" ");
    String[] result = new String[parts.length - len + 1];
    for(int i = 0; i < parts.length - len + 1; i++) {
       StringBuilder sb = new StringBuilder();
       for(int k = 0; k < len; k++) {
           if(k > 0) sb.append(' ');
           sb.append(parts[i+k]);
       }
       result[i] = sb.toString();
    }
    return result;
}

例如

System.out.println(Arrays.toString(ngrams("This is my car", 2)));
//--> [This is, is my, my car]
System.out.println(Arrays.toString(ngrams("This is my car", 3)));
//--> [This is my, is my car] 
收藏
评论

您正在寻找ShingleFilter

更新:链接指向版本3.0.2。在更高版本的Lucene中,此类可能位于不同的包中。

收藏
评论
新手导航
  • 社区规范
  • 提出问题
  • 进行投票
  • 个人资料
  • 优化问题
  • 回答问题

关于我们

常见问题

内容许可

联系我们

@2020 AskGo
京ICP备20001863号