使用nltk改进对人名的提取[关闭]
nlp
nltk
python
6
0

我正在尝试从文本中提取人名。

有人有推荐的方法吗?

这是我尝试过的(下面的代码):我正在使用nltk查找标记nltk所有内容,然后生成该人的所有NNP部分的列表。我正在跳过只有一个NNP可以避免抓住一个孤独姓氏的人。

我得到了不错的结果,但是想知道是否有更好的方法来解决这个问题。

码:

import nltk
from nameparser.parser import HumanName

def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    person_list = []
    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []

    return (person_list)

text = """
Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
print "LAST, FIRST"
for name in names: 
    last_first = HumanName(name).last + ', ' + HumanName(name).first
        print last_first

输出:

LAST, FIRST
Velde, Francois
Branson, Richard
Galactic, Virgin
Krugman, Paul
Summers, Larry
Colas, Nick

除了维珍银河,这都是有效的输出。当然,在本文中了解维珍银河不是人的名字是很困难的(也许是不可能的)部分。

参考资料:
Stack Overflow
收藏
评论
共 6 个回答
高赞 时间 活跃

我想在这里发布一个残酷和贪婪的解决方案,以解决@Enthusiast提出的问题:如果可能,请获取一个人的全名。

每个名称中第一个字符的大写用作识别Spacy PERSON的标准。例如,“吉姆·霍夫曼”本身不会被识别为命名实体,而“吉姆·霍夫曼”会被识别为命名实体。

因此,如果我们的任务只是从脚本中挑选人员,那么我们可以简单地首先将每个单词的首字母大写,然后将其转储为spacy

import spacy

def capitalizeWords(text):

  newText = ''

  for sentence in text.split('.'):
    newSentence = ''
    for word in sentence.split():
      newSentence += word+' '
    newText += newSentence+'\n'

  return newText

nlp = spacy.load('en_core_web_md')

doc = nlp(capitalizeWords(rawText))

#......

请注意,此方法以增加误报为代价覆盖了全名。

收藏
评论

我实际上只想提取人名,因此,我想根据wordnet(一个大型的英语词汇数据库)检查输出的所有姓名。有关Wordnet的更多信息,请参见: http : //www.nltk.org/howto/wordnet.html

import nltk
from nameparser.parser import HumanName
from nltk.corpus import wordnet


person_list = []
person_names=person_list
def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)

    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []
#     print (person_list)

text = """

Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
for person in person_list:
    person_split = person.split(" ")
    for name in person_split:
        if wordnet.synsets(name):
            if(name in person):
                person_names.remove(person)
                break

print(person_names)

输出值

['Francois R. Velde', 'Richard Branson', 'Economist Paul Krugman', 'Nick Colas']

除了拉里·萨默斯(Larry Summers)之外,所有名称都是正确的,这是由于姓氏“ Summers”。

收藏
评论

您可以尝试对找到的名称进行解析,并检查是否可以在诸如freebase.com之类的数据库中找到它们。在本地获取数据并进行查询(在RDF中),或使用Google的api: https : //developers.google.com/freebase/v1/getting-started 。然后,可以根据Freebase数据丢弃大多数大公司,地理位置等(您的代码段会捕获这些信息)。

收藏
评论

对于寻找的其他人,我发现本文很有用: http : //timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code

>>> import nltk
>>> def extract_entities(text):
...     for sent in nltk.sent_tokenize(text):
...         for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
...             if hasattr(chunk, 'node'):
...                 print chunk.node, ' '.join(c[0] for c in chunk.leaves())
...
收藏
评论

@trojane的答案对我来说不是很有效,但是对于我来说却很有帮助。

先决条件

创建一个文件夹stanford-ner并下载以下两个文件:

脚本

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import nltk
from nltk.tag.stanford import StanfordNERTagger

text = u"""
Some economists have responded positively to Bitcoin, including
Francois R. Velde, senior economist of the Federal Reserve in Chicago
who described it as "an elegant solution to the problem of creating a
digital currency." In November 2013 Richard Branson announced that
Virgin Galactic would accept Bitcoin as payment, saying that he had invested
in Bitcoin and found it "fascinating how a whole new global currency
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical.
Economist Paul Krugman has suggested that the structure of the currency
incentivizes hoarding and that its value derives from the expectation that
others will accept it as payment. Economist Larry Summers has expressed
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market
strategist for ConvergEx Group, has remarked on the effect of increasing
use of Bitcoin and its restricted supply, noting, "When incremental
adoption meets relatively fixed supply, it should be no surprise that
prices go up. And that’s exactly what is happening to BTC prices.
"""

st = StanfordNERTagger('stanford-ner/english.all.3class.distsim.crf.ser.gz',
                       'stanford-ner/stanford-ner.jar')

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1] in ["PERSON", "LOCATION", "ORGANIZATION"]:
            print(tag)

结果

(u'Bitcoin', u'LOCATION')       # wrong
(u'Francois', u'PERSON')
(u'R.', u'PERSON')
(u'Velde', u'PERSON')
(u'Federal', u'ORGANIZATION')
(u'Reserve', u'ORGANIZATION')
(u'Chicago', u'LOCATION')
(u'Richard', u'PERSON')
(u'Branson', u'PERSON')
(u'Virgin', u'PERSON')         # Wrong
(u'Galactic', u'PERSON')       # Wrong
(u'Bitcoin', u'PERSON')        # Wrong
(u'Bitcoin', u'LOCATION')      # Wrong
(u'Bitcoin', u'LOCATION')      # Wrong
(u'Paul', u'PERSON')
(u'Krugman', u'PERSON')
(u'Larry', u'PERSON')
(u'Summers', u'PERSON')
(u'Bitcoin', u'PERSON')        # Wrong
(u'Nick', u'PERSON')
(u'Colas', u'PERSON')
(u'ConvergEx', u'ORGANIZATION')
(u'Group', u'ORGANIZATION')     
(u'Bitcoin', u'LOCATION')       # Wrong
(u'BTC', u'ORGANIZATION')       # Wrong
收藏
评论

必须同意“使我的代码更好”这个网站不太适合的建议,但是我可以为您提供一些尝试的途径。

看看斯坦福命名实体识别器(NER) 。它的绑定已包含在NLTK v 2.0中,但是您必须下载一些核心文件。这是可以为您完成所有操作的脚本

我写了这个脚本:

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

并没有那么糟糕的输出:

('Francois','PERSON')('R.','PERSON')('Velde','PERSON')('Richard','PERSON')('Branson','PERSON')('Virgin' ,'PERSON')('Galactic','PERSON')('Bitcoin','PERSON')('Bitcoin','PERSON')('Paul','PERSON')('Krugman','PERSON') ('Larry','PERSON')('Summers','PERSON')('Bitcoin','PERSON')('Nick','PERSON')('Colas','PERSON')

希望这会有所帮助。

收藏
评论
新手导航
  • 社区规范
  • 提出问题
  • 进行投票
  • 个人资料
  • 优化问题
  • 回答问题

关于我们

常见问题

内容许可

联系我们

@2020 AskGo
京ICP备20001863号