模糊字符串比较
nlp
python
6
0

我正在努力完成的是一个程序,该程序读取文件并根据原始句子比较每个句子。与原始句子完全匹配的句子将得到1分,而与之相反的句子将得到0分。所有其他模糊句子将得到1到0分之间的分数。

我不确定要使用哪个操作在Python 3中完成此操作。

我包括了示例文本,其中文本1是原始文本,而其他前面的字符串是比较文本。

文字:样本

文字1:那是一个黑暗而暴风雨的夜晚。我一个人坐在红色的椅子上。我并不孤单,因为我只有三只猫。

文字20:那是一个阴暗而暴风雨的夜晚。我独自一人坐在深红色的椅子上。我并不孤单,因为我有三只猫科动物//应该得分高,但不能得分1

文字21:那是一个阴暗而狂暴的夜晚。我一个人坐在一个深红色的大教堂上。我并不孤单,因为我有三只猫科动物//应该得分低于文字20

文字22:我一个人坐在一个深红色的大教堂上。我并不孤单,因为我有三只猫科动物。那是一个阴暗而狂暴的夜晚。 //分数应低于文字21,但不能低于0

文字24:那是一个黑暗而暴风雨的夜晚。我并不孤单。我没有坐在红色的椅子上。我有三只猫。 //应该得分为0!

参考资料:
Stack Overflow
收藏
评论
共 3 个回答
高赞 时间 活跃

fuzzyset是速度远远超过fuzzywuzzydifflib )两个索引和搜索。

from fuzzyset import FuzzySet
corpus = """It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines
    It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines
    I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.
    It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats."""
corpus = [line.lstrip() for line in corpus.split("\n")]
fs = FuzzySet(corpus)
query = "It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats."
fs.get(query)
# [(0.873015873015873, 'It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines')]

警告:注意不要在模糊集中混用unicodebytes

收藏
评论

有一个名为fuzzywuzzy的软件包。通过pip安装:

pip install fuzzywuzzy

简单用法:

>>> from fuzzywuzzy import fuzz
>>> fuzz.ratio("this is a test", "this is a test!")
    96

该软件包建立在difflib 。您问为什么不仅仅使用它?除了更简单之外,它还具有许多不同的匹配方法(例如令牌顺序不敏感,部分字符串匹配),这使其在实践中更加强大。 process.extract函数特别有用:从集合中找到最匹配的字符串和比率。从他们的自述文件:

偏比

>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100

代币分类率

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    90
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

代币设定比率

>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    100

处理

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
    [('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
    ("Dallas Cowboys", 90)
收藏
评论

标准库中有一个模块(称为difflib ),可以比较字符串并根据它们的相似性返回分数。 SequenceMatcher类应该做您想要做的。

编辑:来自python提示符的小例子:

>>> from difflib import SequenceMatcher as SM
>>> s1 = ' It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.'
>>> s2 = ' It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines.'
>>> SM(None, s1, s2).ratio()
0.9112903225806451

HTH!

收藏
评论
新手导航
  • 社区规范
  • 提出问题
  • 进行投票
  • 个人资料
  • 优化问题
  • 回答问题

关于我们

常见问题

内容许可

联系我们

@2020 AskGo
京ICP备20001863号