fuzzyset
是速度远远超过fuzzywuzzy
( difflib
)两个索引和搜索。
from fuzzyset import FuzzySet
corpus = """It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines
It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines
I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.
It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats."""
corpus = [line.lstrip() for line in corpus.split("\n")]
fs = FuzzySet(corpus)
query = "It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats."
fs.get(query)
# [(0.873015873015873, 'It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines')]
警告:注意不要在模糊集中混用unicode
和bytes
。
0
我正在努力完成的是一个程序,该程序读取文件并根据原始句子比较每个句子。与原始句子完全匹配的句子将得到1分,而与之相反的句子将得到0分。所有其他模糊句子将得到1到0分之间的分数。
我不确定要使用哪个操作在Python 3中完成此操作。
我包括了示例文本,其中文本1是原始文本,而其他前面的字符串是比较文本。
文字:样本
文字1:那是一个黑暗而暴风雨的夜晚。我一个人坐在红色的椅子上。我并不孤单,因为我只有三只猫。
文字20:那是一个阴暗而暴风雨的夜晚。我独自一人坐在深红色的椅子上。我并不孤单,因为我有三只猫科动物//应该得分高,但不能得分1
文字21:那是一个阴暗而狂暴的夜晚。我一个人坐在一个深红色的大教堂上。我并不孤单,因为我有三只猫科动物//应该得分低于文字20
文字22:我一个人坐在一个深红色的大教堂上。我并不孤单,因为我有三只猫科动物。那是一个阴暗而狂暴的夜晚。 //分数应低于文字21,但不能低于0
文字24:那是一个黑暗而暴风雨的夜晚。我并不孤单。我没有坐在红色的椅子上。我有三只猫。 //应该得分为0!