@trojane的答案对我来说不是很有效,但是对于我来说却很有帮助。
先决条件
创建一个文件夹stanford-ner
并下载以下两个文件:
脚本
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import nltk
from nltk.tag.stanford import StanfordNERTagger
text = u"""
Some economists have responded positively to Bitcoin, including
Francois R. Velde, senior economist of the Federal Reserve in Chicago
who described it as "an elegant solution to the problem of creating a
digital currency." In November 2013 Richard Branson announced that
Virgin Galactic would accept Bitcoin as payment, saying that he had invested
in Bitcoin and found it "fascinating how a whole new global currency
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical.
Economist Paul Krugman has suggested that the structure of the currency
incentivizes hoarding and that its value derives from the expectation that
others will accept it as payment. Economist Larry Summers has expressed
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market
strategist for ConvergEx Group, has remarked on the effect of increasing
use of Bitcoin and its restricted supply, noting, "When incremental
adoption meets relatively fixed supply, it should be no surprise that
prices go up. And that’s exactly what is happening to BTC prices.
"""
st = StanfordNERTagger('stanford-ner/english.all.3class.distsim.crf.ser.gz',
'stanford-ner/stanford-ner.jar')
for sent in nltk.sent_tokenize(text):
tokens = nltk.tokenize.word_tokenize(sent)
tags = st.tag(tokens)
for tag in tags:
if tag[1] in ["PERSON", "LOCATION", "ORGANIZATION"]:
print(tag)
结果
(u'Bitcoin', u'LOCATION') # wrong
(u'Francois', u'PERSON')
(u'R.', u'PERSON')
(u'Velde', u'PERSON')
(u'Federal', u'ORGANIZATION')
(u'Reserve', u'ORGANIZATION')
(u'Chicago', u'LOCATION')
(u'Richard', u'PERSON')
(u'Branson', u'PERSON')
(u'Virgin', u'PERSON') # Wrong
(u'Galactic', u'PERSON') # Wrong
(u'Bitcoin', u'PERSON') # Wrong
(u'Bitcoin', u'LOCATION') # Wrong
(u'Bitcoin', u'LOCATION') # Wrong
(u'Paul', u'PERSON')
(u'Krugman', u'PERSON')
(u'Larry', u'PERSON')
(u'Summers', u'PERSON')
(u'Bitcoin', u'PERSON') # Wrong
(u'Nick', u'PERSON')
(u'Colas', u'PERSON')
(u'ConvergEx', u'ORGANIZATION')
(u'Group', u'ORGANIZATION')
(u'Bitcoin', u'LOCATION') # Wrong
(u'BTC', u'ORGANIZATION') # Wrong
0
我正在尝试从文本中提取人名。
有人有推荐的方法吗?
这是我尝试过的(下面的代码):我正在使用
nltk
查找标记nltk
所有内容,然后生成该人的所有NNP部分的列表。我正在跳过只有一个NNP可以避免抓住一个孤独姓氏的人。我得到了不错的结果,但是想知道是否有更好的方法来解决这个问题。
码:
输出:
除了维珍银河,这都是有效的输出。当然,在本文中了解维珍银河不是人的名字是很困难的(也许是不可能的)部分。