期刊文献+

K-means算法中文文献聚类的Python实现 被引量:6

Chinese Literature Clustering Research Based on Python K-means Algorithm
下载PDF
导出
摘要 聚类是对文本信息进行有效组织、摘要和导航的重要手段。K-means算法是非常典型的基于距离的聚类算法,将其用于中文文献聚类,按照内容相似性把一组文献分成几个类并发现其中的隐形知识。本文通过实例,总结了基于Python语言的K-means算法用于中文文献聚类过程,通过CH指标、轮廓系数指标和SSE指标这三个评价指标选取K-means算法的初始聚类簇数,即最优k值的取值范围,然后分别按照基于关键词和基于摘要对文献进行聚类,并对聚类结果进行比较分析,从而得出基于摘要对中文文献进行聚类可以得到更好结果的结论,同一类别中的文献可以进行关键词聚类,从而进一步挖掘其中的隐形知识。 Clustering is an important means of effective organization, summarization and navigation of text information. The K-means algorithm is a very typical distance-based clustering algorithm. It is used for Chinese document clustering. According to the content similarity, a group of documents is divided into several categories and the invisible knowledge is found. In this paper, the K-means algorithm based on Python language is used to summarize the Chinese literature clustering process. The initial cluster cluster number of K-means algorithm is selected by three evaluation indexes: CH index, contour coefficient index and SSE index. The range of optimal k-values is then clustered according to keywords and based on abstracts, and the clustering results are compared and analyzed, so that the clustering of Chinese documents based on abstracts can get better results. In conclusion, the literature in the same category can be clustered by keywords to further explore the invisible knowledge.
作者 赵谦益 ZHAO Qian-yi(Guizhou University of Finance and Economics School of Information, Guiyang, China)
出处 《软件》 2019年第8期89-94,共6页 Software
关键词 K-MEANS算法 文献聚类 评价指标 K-means algorithm Literature clustering Evaluation index
  • 相关文献

参考文献10

二级参考文献89

共引文献389

同被引文献66

引证文献6

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部