摘要
为了提高文本聚类的性能,采用k-modes算法进行文本聚类,并采用知识图谱进行样本预分析,以提高k-modes的文本聚类适用度。采用知识图谱进行样本预处理,对待聚类的文本进行知识图谱三元分析,并生成对应概念、实体和关系的样本集合;接着建立k-modes文本聚类模型,设定簇内节点至簇中心的距离值之和为目标函数,通过轮流固定隶属矩阵和聚类簇矩阵,不断求解目标函数的最小值直至目标函数值稳定,获得簇中心,最后根据簇中心及各节点到簇中心距离来确定聚类结果。实验表明,经过知识图谱分析之后,k-modes算法能够获得更优的纯度、标准互信息和F值性能,且聚类纯度的均方根误差(Root mean squared error,RMSE)值更低;和常用文本聚类算法相比,对于UCI集和新闻集,该文算法均表现出了更高的聚类准确率。
In order to improve the performance of text clustering,k-modes algorithm is used to cluster text,and knowledge map is used to analyze the sample to improve the applicability of k-modes.The knowledge map is used to preprocess the samples,and the clustering text is analyzed by three elements,and the corresponding concepts,entities and relationships are generated.The k-modes text clustering model is established,and the sum of distance between nodes and cluster centers is set as the target function.By rotating fixed membership matrix and cluster matrix,the minimum value of the objective function is continuously solved until the target function value is stable,and the cluster centers are obtained.The cluster results are determined according to the distance between the cluster center and the center of each node.The experiment shows that the k-modes algorithm can obtain better clustering purity,normalized mutual information and F value performance after knowledge map analysis,and the RMSE value of cluster purity is lower.Compared with the common text clustering algorithm,this algorithm for UCI set and news set shows higher clustering accuracy than usual.
作者
高静
王钢
Gao Jing;Wang Gang(School of Information and Mechatronics Engineering,Zhengzhou Business University,Gongyi 451200,China)
出处
《南京理工大学学报》
CAS
CSCD
北大核心
2022年第1期76-82,共7页
Journal of Nanjing University of Science and Technology
基金
国家自然科学基金(61961010)
河南省教育厅高等学校重点科研项目(20B120003)
河南省教育厅项目(2020YB0403)
郑州商学院新工科创新融合团队项目(2021-CXTD-05)。