一种有效而准确的内存存储方式是
-
scikit
中的CountVectorizer(用于ngram提取)
- NLTK用于
word_tokenize
-
numpy
矩阵总和以收集计数
-
collections.Counter
用于收集计数和词汇
一个例子:
import urllib.request
from collections import Counter
import numpy as np
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
# Our sample textfile.
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
# Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = ngram_vectorizer.fit_transform(data.split('\n'))
# Vocabulary
vocab = list(ngram_vectorizer.get_feature_names())
# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1
freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))
[出]:
[(',', 32000),
('.', 17783),
('de', 11225),
('a', 7197),
('que', 5710),
('la', 4732),
('je', 4304),
('se', 4013),
('на', 3978),
('na', 3834)]
本质上,您还可以执行以下操作:
from collections import Counter
import numpy as np
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
def freq_dist(data):
"""
:param data: A string with sentences separated by '\n'
:type data: str
"""
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
X = ngram_vectorizer.fit_transform(data.split('\n'))
vocab = list(ngram_vectorizer.get_feature_names())
counts = X.sum(axis=0).A1
return Counter(dict(zip(vocab, counts)))
让我们timeit
:
import time
start = time.time()
word_distribution = freq_dist(data)
print (time.time() - start)
[出]:
5.257147789001465
请注意, CountVectorizer
也可以使用文件而不是字符串,并且这里不需要将整个文件读入内存 。在代码中:
import io
from collections import Counter
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
infile = '/path/to/input.txt'
ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)
with io.open(infile, 'r', encoding='utf8') as fin:
X = ngram_vectorizer.fit_transform(fin)
vocab = ngram_vectorizer.get_feature_names()
counts = X.sum(axis=0).A1
freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))
0
我想计算一个文本文件中所有单词的频率。
如果目标文本文件如下所示
{'aaa':1, 'bbb': 2, 'ccc':1}
则应返回{'aaa':1, 'bbb': 2, 'ccc':1}
。在一些帖子之后,我已经用纯python实现了它。但是,我发现由于巨大的文件大小(> 1GB),纯python方法是不够的。
我认为借用sklearn的能力是一个候选人。
如果让CountVectorizer为每一行计数频率,我想您将通过累加每一列来获得字频率。但是,这听起来有点间接。
用python计算文件中单词的最有效,最直接的方法是什么?
更新资料
我的代码(很慢)在这里: