摘要
主题词挖掘易受到噪声数据的干扰,导致算法困惑度较高,为此提出融合LDA主题模型的数据库主题词挖掘算法。结合曲波阈值和ICEEMDAN方法对数据库中的数据展开降噪处理,避免噪声对主题词挖掘产生影响;结合残差分析法和线性回归法选取数据的聚类中心,将数据库中的数据划分为多个类别;通过LDA主题模型和图模型在不同数据类别中计算词语的得分,选取得分最高的词语作为数据库主题词,完成数据库主题词的挖掘。实验结果表明,所提算法的数据聚类精度高、挖掘结果困惑度低、挖掘效率高。
Topic word mining is easy to be interfered by noise data,leading to high perplexity.Therefore,this article put forward an algorithm for mining topic words in multiple databases based on LDA topic model.First of all,the curve threshold was combined with ICEEMDAN method to reduce the noise from the data in the database,thus avoiding the impact of noise on the topic words mining.Combined with residual analysis and linear regression,the cluster center of data was selected,and then the data were divided into multiple categories.Moreover,the LDA topic model and graph model were used to calculate the scores of words in different data categories,and then the words with the highest scores were selected as the topic words of the database.Finally,the mining of database subject words was achieved.Experimental results show that the proposed algorithm has high data clustering accuracy,low perplexity of result and high mining efficiency.
作者
彭灿华
韦晓敏
PENG Can-hua;WEI Xiao-min(Faculty of Information Engineering,Guilin Institute of Information Technology,Guilin Guangxi 541004,China;Faculty of Computer Science,Guilin University of Electronic Technology,Guilin Guangxi 541004,China)
出处
《计算机仿真》
北大核心
2023年第8期483-487,共5页
Computer Simulation
基金
校级科研立项课题(XJ202008)。
关键词
主题模型
数据降噪
残差分析法
主题词挖掘
LDA topic model
Data noise reduction
Residual analysis method
ICEEMDAN method
Topic words mining