期刊文献+

基于CBOW-LDA主题模型的Stack Overflow编程网站热点主题发现研究 被引量:4

Hot Topic Discovery Research of Stack Overflow Programming Website Based on CBOW-LDA Topic Model
下载PDF
导出
摘要 Stack Overflow是一个热门的国外编程问答网站,通过对该网站编程提问帖的问题文本进行文本语义挖掘,能获析用户关注的编程热点。由于研究对象所代表的短文本信息具有高维性及分布不均的特点,易导致主题获取不明晰。文中提出一种基于LDA(Latent Dirichlet Allocation)主题模型的CBOW-LDA建模方法,该方法对目标语料进行相似词聚类后再完成主题建模,能有效降低文本输入维度,使主题分布更明确。采集Stack Overflow网站上2010-2015年的问题帖数据集POST,并对其进行实验,同等主题数下采用文本建模中衡量模型性能的评价指标困惑度(Perplexity)来度量算法在不同数据集容量维度下的性能。结果表明,与现有的基于词频权重的词量化主题建模TFLDA方法相比,CBOW-LDA方法的困惑度更低,在实验语料下的困惑度降低约4.87%,证明了所提算法的性能更好。采用CBOW-LDA方法对Stack Overflow进行热点挖掘,同时使用TF-LDA方法进行对比实验,建立手工标注的标准评测集对两种方法获取的热门主题和热搜词汇进行查全率、查准率及F1值的判定,结果证实CBOW-LDA表现更佳,其热点挖掘效果较好。由实验结果可知,Java为该编程网站提问帖中最热门的主题,而C和Javascript则为该网站用户提问中被提及得最频繁的词汇。 Stack Overflow is a popular programming question and answer(Q&A)website,we can gather the hot programming knowledge which the developers focus on by studying the programming question text semantic mining.Owing to the high dimensionality problem which hinders processing efficiency and the topic distribution which makes topics unclear,it is difficult to detect topics from a large number of short texts in social network.To overcome these problems,this paper proposed a new LDA(Latent Dirichlet Allocation)model based topic detection method called CBOW-LDA topic modeling method.Using the model to target language and clustering similar words by vectors similarity before topic detection can decrease the dimensions of LDA output and make topics more clearly.Through the analysis of topic perplexity in the experiment dataset with different data collection capacity about the POST on Stack Overflow in 2010-2015,it is obvious that topics detected by our method has a lower perplexity,comparing with word frequency weighing based vectors named TF-LDA.In a condition of same number of topic words from the corpus,perplexity is reduced by about 4.87%,which means CBOW-LDA model performs better.When acting CBOW-LDA method in hot topic on Stack Overflow,TF-LDA method was used to be compared as well,and this paper established a manual annotation standard evaluation set and used Recall,Precision and F1 to contrast experiment results.This paper confirmed that the CBOW-LDA method has better effect because each measuring value of CBOW-LDA is better than TF-LDA,which proves that the hot spot mining effect of CBOW-LDA is good.Through ourexperiment,this paper effectively found out the hot issues of the theme and hot words in nearly six years.This paper drew the conclusion that“Java”is the hottest topic in the website,and“JavaScript”and“C”are the favorite words mentioned in questions from the users.
作者 张景 朱国宾 ZHANG Jing;ZHU Guo-bin(Department of International Software,Wuhan University,Wuhan 430079,China)
出处 《计算机科学》 CSCD 北大核心 2018年第4期208-214,共7页 Computer Science
基金 国家科技支撑计划(2012BAH01F02)资助
关键词 STACK OVERFLOW LDA-CBOW语言模型 主题发现 热门主题 困惑度 Stack Overflow LDA-CBOW language model Topic detection Hot topic Perplexity
  • 相关文献

参考文献1

二级参考文献9

  • 1Cheng Xueqi, Yan Xiaohui, Lan Yanyan, et al. BTM: Topic Modeling Over Short Texts[J]. IEEE Trans on Knowledge and Data Engineering, 2014, 26(12): 2928-2941.
  • 2Mikolow Tomas, Yih Wentau Scott, Zweiq Geoffery. Linguistic Reqularities in Contrmcous Space Word Representations[C]//Proceedings of the 12nd Conference of the North Anerican Chapter of the Association for Computational Linguistics, Atlanta, USA: NAACL, 2013.
  • 3Dermouche M, Velcin J, Khouas L, et al. A Joint Model for Topic-Sentiment Evolution Over Time[C]//Proceedings of 14th IEEE International Conference on Data Mining. Shenzhen, China, 2014.
  • 4Huang Bo, Yang Yan, Mahmood Amjad, et al. Microblog Topic Detection Based on LDA Model and Single-Pass Clustering[C]//Proceedings of 7th International Conference on Rough Sets and Current Trends in Computing. Chengdu, China, 2012.
  • 5Darling M William, Song Fei. Probabilistic Topic and Syntax Modeling with Part-of-Speech LDA[J]. ArXiv:1303.2826, 2013.
  • 6Bai Xue, Chen Fu, Zhan Shaobin. A New Clustering Model Based on Word2vec Mining on Sina Weibo Users' Tags[J]. International Journal of Grid Distribution Computing, 2014, 7(3): 41-48.
  • 7Zhou Xinjie, Wan Xiaojun, Xiao Jianguo. Repre-Sntation Learning for Aspect Category Detection in Online Reviews[C]. Proceedings of the 29th AAAI Conference on Artificial Intelligence. Austin, Texas, USA, 2015.
  • 8Mikolov Tomas, Sutskever Hya. Distributed Representutions of Words and Phrases and Their Compositionality[C]//Proceedings of the Ilth Newral Information Processing Systems Conference Lake Tahoe, USA: NIPS, 2013.
  • 9Cao Ziqiang, Li Sujian, Liu Yang, et al. A Novel Neural Topic Model and Its Supervised Extension[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence. Austin, Texas, USA, 2015.

共引文献21

同被引文献32

引证文献4

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部