如何从gensim打印LDA主题模型?蟒蛇
gensim
nlp
python
12
0

使用gensim我能够从LSA中的一组文档中提取主题,但是如何访问从LDA模型生成的主题?

当打印lda.print_topics(10) ,代码出现以下错误,因为print_topics()返回NoneType

Traceback (most recent call last):
  File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module>
    for top in lda.print_topics(2):
TypeError: 'NoneType' object is not iterable

编码:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
         for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# I can print out the topics for LSA
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus]

for l,t in izip(corpus_lsi,corpus):
  print l,"#",t
print
for top in lsi.print_topics(2):
  print top

# I can print out the documents and which is the most probable topics for each doc.
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]

for l,t in izip(corpus_lda,corpus):
  print l,"#",t
print

# But I am unable to print out the topics, how should i do it?
for top in lda.print_topics(10):
  print top
参考资料:
Stack Overflow
收藏
评论
共 6 个回答
高赞 时间 活跃

这是打印主题的示例代码:

def ExtractTopics(filename, numTopics=5):
    # filename is a pickle file where I have lists of lists containing bag of words
    texts = pickle.load(open(filename, "rb"))

    # generate dictionary
    dict = corpora.Dictionary(texts)

    # remove words with low freq.  3 is an arbitrary number I have picked here
    low_occerance_ids = [tokenid for tokenid, docfreq in dict.dfs.iteritems() if docfreq == 3]
    dict.filter_tokens(low_occerance_ids)
    dict.compactify()
    corpus = [dict.doc2bow(t) for t in texts]
    # Generate LDA Model
    lda = models.ldamodel.LdaModel(corpus, num_topics=numTopics)
    i = 0
    # We print the topics
    for topic in lda.show_topics(num_topics=numTopics, formatted=False, topn=20):
        i = i + 1
        print "Topic #" + str(i) + ":",
        for p, id in topic:
            print dict[int(id)],

        print ""
收藏
评论

您正在使用任何日志记录吗? print_topics按照docs中的说明打印到日志文件

正如@ mac389所说, lda.show_topics()是打印到屏幕上的方式。

收藏
评论

我认为show_topics的语法已随着时间而改变:

show_topics(num_topics=10, num_words=10, log=False, formatted=True)

对于num_topics个主题数,请返回num_words个最高有效词(默认情况下,每个主题10个词)。

主题以列表形式返回-如果格式为True,则为字符串列表,如果为False,则为(概率,单词)2元组的列表。

如果log为True,则将此结果也输出到log。

与LSA不同,LDA中的主题之间没有自然的顺序。因此,返回的所有主题的num_topics <= self.num_topics子集是任意的,并且可能在两次LDA训练运行之间发生变化。

收藏
评论

一些插科打诨后,好像print_topics(numoftopics)ldamodel有一些bug。所以我的解决方法是使用print_topic(topicid)

>>> print lda.print_topics()
None
>>> for i in range(0, lda.num_topics-1):
>>>  print lda.print_topic(i)
0.083*response + 0.083*interface + 0.083*time + 0.083*human + 0.083*user + 0.083*survey + 0.083*computer + 0.083*eps + 0.083*trees + 0.083*system
...
收藏
评论

您可以使用:

for i in  lda_model.show_topics():
    print i[0], i[1]
收藏
评论

我认为将主题视为单词列表总是更有帮助的。以下代码段有助于实现该目标。我假设您已经有一个称为lda_model的lda模型。

for index, topic in lda_model.show_topics(formatted=False, num_words= 30):
    print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))

在上面的代码中,我决定显示属于每个主题的前30个单词。为简单起见,我展示了第一个主题。

Topic: 0 
Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental']
Topic: 1 
Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']

我不太喜欢上述主题的外观,因此通常将代码修改为如下所示:

for idx, topic in lda_model.show_topics(formatted=False, num_words= 30):
    print('Topic: {} \nWords: {}'.format(idx, '|'.join([w[0] for w in topic])))

...,输出(显示前两个主题)将如下所示。

Topic: 0 
Words: associate|incident|time|task|pain|amcare|work|ppe|train|proper|report|standard|pmv|level|perform|wear|date|factor|overtime|location|area|yes|new|treatment|start|stretch|assign|condition|participate|environmental
Topic: 1 
Words: work|associate|cage|aid|shift|leave|area|eye|incident|aider|hit|pit|manager|return|start|continue|pick|call|come|right|take|report|lead|break|paramedic|receive|get|inform|room|head
收藏
评论
新手导航
  • 社区规范
  • 提出问题
  • 进行投票
  • 个人资料
  • 优化问题
  • 回答问题