摘要
随着互联网的发展,微博已成为人们获取信息的主要平台,为从海量微博中挖掘出有价值的主题信息,结合微博中的会话、转发和话题标签,将微博划分为用户兴趣、用户互动和话题微博3类,提出基于作者主题模型(ATM)的话题标签主题模型HC-ATM,使用Gibbs抽样法对模型进行推导,获取微博主题结构。在Twitter数据集上的实验结果表明,与ATM模型和基于潜在狄利克雷分布的微博生成模型相比,HC-ATM模型的主题困惑度更小、差异度更大,并且能有效挖掘出不同微博类型的主题分布。
With the development of the Internet,microblog has become a major platform for people to obtain the information. In order to mine useful topic from microblog,based on the futures of microblog that having conversation tags,retw eet tags and hashtags,this paper divides microblog into three kinds. They are microblogs about users' interest,users interaction and hashtag-related. It designs a novel hashtag topic model named Hashtag Conversation Author Topic M odel( HC-ATM) based on Author Topic M odel( ATM),and uses Gibbs sampling implementation for inference of this model. Experiments on Tw itter dataset show that HC-ATM outperforms the ATM and M icro Blog Latent Dirichlet Allocation( M B-LDA) in terms of both perplexity and KL-divergence. Besides,HC-ATM can mine topic distribution of different kinds of microblog effectively.
出处
《计算机工程》
CAS
CSCD
北大核心
2015年第4期30-35,共6页
Computer Engineering
基金
国家自然科学基金资助项目(61033010
61272065)
广东省自然科学基金资助项目(S2011020001182
S2012010009311)
广东省科技计划基金资助项目(2011B040200007
2012A010701013)