LDA主题模型

Reads: 2050 Edit

1 问题描述

为了演示LDA主题模型,我们将词云分析中的《政府工作报告》拆成3个txt文档,存放在report文件夹下(在实际文本挖掘中,大部分都需要读取多个文档),以此为例来演示LDA主题模型的操作!

2 导入所需模块

注意:pyDAvis最好下载2.1.2版本,即pip install pyLDAvis==2.1.2

import pyLDAvis.gensim
import pyLDAvis
import jieba
from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary

3 从本地读取文档

report1=open('./report/回顾.txt','r+',encoding = 'gbk').read()
report2=open('./report/任务.txt','r+',encoding = 'gbk').read()
report3=open('./report/重点.txt','r+',encoding = 'gbk').read()

documents=[report1,report2,report3]

4 分词并去掉停用词

# 停用词表
stopwords = [line.strip() for line in open('百度停词表.txt', 'r', encoding='utf-8').readlines()]

doc_words = []
for document in documents:
    # 采用jieba进行分词、词性筛选、去停用词
    words = [word for word in jieba.lcut(document.strip()) if word not in stopwords and word not in ["—","\n"]]
    doc_words.append(words)

5 构造词典、稀疏向量集、训练lda模型,

dictionary = Dictionary(doc_words)

corpus = [dictionary.doc2bow(words) for words in doc_words]

由于本例中3个文档均来自同一篇政府报告,所以主题设置不宜太多,这里设置为2个主题。

lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=60)

6 给出两个主题的关键词

for topic in lda.print_topics(num_topics=2, num_words=10):
    termNumber = topic[0]
    print(topic[0], ':', sep='')
    listOfTerms = topic[1].split('+')
    for term in listOfTerms:
        listItems = term.split('*')
        print('  ', listItems[1], '(', listItems[0], ')', sep='')

pyt-171

7 图形化展示主题的关键词

graph=pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.show(graph)

pyt-172

pyt-173



获取案例数据和源代码,请关注微信公众号并回复:Python_dt34


Comments

Make a comment