在python中有效地计算单词频率
nlp
python
scikit-learn
4
0

我想计算一个文本文件中所有单词的频率。

>>> countInFile('test.txt')

如果目标文本文件如下所示{'aaa':1, 'bbb': 2, 'ccc':1}则应返回{'aaa':1, 'bbb': 2, 'ccc':1}

# test.txt
aaa bbb ccc
bbb

一些帖子之后,我已经用纯python实现了它。但是,我发现由于巨大的文件大小(> 1GB),纯python方法是不够的。

我认为借用sklearn的能力是一个候选人。

如果让CountVectorizer为每一行计数频率,我想您将通过累加每一列来获得字频率。但是,这听起来有点间接。

用python计算文件中单词的最有效,最直接的方法是什么?

更新资料

我的代码(很慢)在这里:

from collections import Counter

def get_term_frequency_in_file(source_file_path):
    wordcount = {}
    with open(source_file_path) as f:
        for line in f:
            line = line.lower().translate(None, string.punctuation)
            this_wordcount = Counter(line.split())
            wordcount = add_merge_two_dict(wordcount, this_wordcount)
    return wordcount

def add_merge_two_dict(x, y):
    return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }
参考资料:
Stack Overflow
收藏
评论
共 4 个回答
高赞 时间 活跃

最简洁的方法是使用Python提供的工具。

from future_builtins import map  # Only on Python 2

from collections import Counter
from itertools import chain

def countInFile(filename):
    with open(filename) as f:
        return Counter(chain.from_iterable(map(str.split, f)))

而已。 map(str.split, f)使生成器从每一行返回单词list s。包装在chain.from_iterable将其转换为单个生成器,一次生成一个单词。 Counter接受一个可迭代的输入,并计算其中的所有唯一值。最后,您return一个类似dict的对象(一个Counter ),该对象存储所有唯一单词及其计数,在创建过程中,您一次只存储一行数据和总计数,而不是一次存储整个文件。

从理论上讲,在Python 2.7和3.1上,您可能会更好地循环遍历链接结果,并使用dictcollections.defaultdict(int)进行计数(因为Counter是在Python中实现的,因此在某些情况下它会变慢) ,但是让Counter进行工作更简单,更易于记录(我的意思是,整个目标都在计数,因此请使用Counter )。除此之外,在CPython(参考解释器)3.2及更高版本上, Counter具有C级加速器,用于对可迭代的输入进行计数,其运行速度将比纯Python中编写的任何代码都要快。

更新:您似乎想删除标点符号并且不区分大小写,所以这是我以前的代码的一个变体,它可以做到:

from string import punctuation

def countInFile(filename):
    with open(filename) as f:
        linewords = (line.translate(None, punctuation).lower().split() for line in f)
        return Counter(chain.from_iterable(linewords))

您的代码运行速度要慢得多,因为它正在创建和销毁许多小的Counterset对象,而不是.update每行一次只.update一个Counter (虽然比我在更新的代码块中给出的速度稍慢,但至少比例因子在算法上相似)。

收藏
评论

一种有效而准确的内存存储方式是

  • scikit中的CountVectorizer(用于ngram提取)
  • NLTK用于word_tokenize
  • numpy矩阵总和以收集计数
  • collections.Counter用于收集计数和词汇

一个例子:

import urllib.request
from collections import Counter

import numpy as np 

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# Our sample textfile.
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')


# Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = ngram_vectorizer.fit_transform(data.split('\n'))

# Vocabulary
vocab = list(ngram_vectorizer.get_feature_names())

# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1

freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))

[出]:

[(',', 32000),
 ('.', 17783),
 ('de', 11225),
 ('a', 7197),
 ('que', 5710),
 ('la', 4732),
 ('je', 4304),
 ('se', 4013),
 ('на', 3978),
 ('na', 3834)]

本质上,您还可以执行以下操作:

from collections import Counter
import numpy as np 
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

def freq_dist(data):
    """
    :param data: A string with sentences separated by '\n'
    :type data: str
    """
    ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
    X = ngram_vectorizer.fit_transform(data.split('\n'))
    vocab = list(ngram_vectorizer.get_feature_names())
    counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

让我们timeit

import time

start = time.time()
word_distribution = freq_dist(data)
print (time.time() - start)

[出]:

5.257147789001465

请注意, CountVectorizer也可以使用文件而不是字符串,并且这里不需要将整个文件读入内存 。在代码中:

import io
from collections import Counter

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/input.txt'

ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)

with io.open(infile, 'r', encoding='utf8') as fin:
    X = ngram_vectorizer.fit_transform(fin)
    vocab = ngram_vectorizer.get_feature_names()
    counts = X.sum(axis=0).A1
    freq_distribution = Counter(dict(zip(vocab, counts)))
    print (freq_distribution.most_common(10))
收藏
评论

这是一些基准。看起来很奇怪,但是最原始的代码胜出了。

[码]:

from collections import Counter, defaultdict
import io, time

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/file'

def extract_dictionary_sklearn(file_path):
    with io.open(file_path, 'r', encoding='utf8') as fin:
        ngram_vectorizer = CountVectorizer(analyzer='word')
        X = ngram_vectorizer.fit_transform(fin)
        vocab = ngram_vectorizer.get_feature_names()
        counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

def extract_dictionary_native(file_path):
    dictionary = Counter()
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            dictionary.update(line.split())
    return dictionary

def extract_dictionary_paddle(file_path):
    dictionary = defaultdict(int)
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            for words in line.split():
                dictionary[word] +=1
    return dictionary

start = time.time()
extract_dictionary_sklearn(infile)
print time.time() - start

start = time.time()
extract_dictionary_native(infile)
print time.time() - start

start = time.time()
extract_dictionary_paddle(infile)
print time.time() - start

[出]:

38.306814909
24.8241138458
12.1182529926

上面的基准测试中使用的数据大小(154MB):

$ wc -c /path/to/file
161680851

$ wc -l /path/to/file
2176141

注意事项:

  • 使用sklearn版本时,矢量化器创建+ numpy操作以及转换为Counter对象的开销sklearn
  • 然后是本机Counter更新版本,似乎Counter.update()是一个昂贵的操作
收藏
评论

这样就足够了。

def countinfile(filename):
    d = {}
    with open(filename, "r") as fin:
        for line in fin:
            words = line.strip().split()
            for word in words:
                try:
                    d[word] += 1
                except KeyError:
                    d[word] = 1
    return d
收藏
评论
新手导航
  • 社区规范
  • 提出问题
  • 进行投票
  • 个人资料
  • 优化问题
  • 回答问题

关于我们

常见问题

内容许可

联系我们

@2020 AskGo
京ICP备20001863号