期刊文献+

基于受限玻尔兹曼机的分布式主题特征提取 被引量:5

Distributed theme feature extraction based on restricted Boltzmann machine
下载PDF
导出
摘要 随着大数据时代的来临,如何有效从海量的文本数据中挖掘和分析主题特征已成为学者们的研究重点。隐含狄利克雷分配(Latent Dirichlet Allocation,LDA)作为经典的概率主题模型,因其自身优越的文本分析能力被广泛应用。然而,该模型大多以包含隐含主题变量的有向图的形式存在,实现文档的表达具有局限性。而分布式表示方法定义文档的语义分布在多个主题中并由多主题特征相乘得到;且由于传统的无监督特征提取模型无法有效处理含类别标记的文档数据,故在研究受限玻尔兹曼机(Restricted Bolzmann Machine,RBM)的基础上,结合文本主题的分布式特性,提出了基于RBM的分布式主题特征提取模型NRBM,其自身作为典型的半监督模型能够有效利用文档中的多标记信息。最终与标准LDA主题模型的对比实验证明了NRBM模型的优越性。 With the advent of the era of big data, it has become the focus of scholars how to effectively explore and analyze the topic characteristics from a large amount of text data. Latent Dirichlet allocation as a classical probability theme model is widely used because of its superior text analysis ability. However, most of the models exist in the form of directed graphs which contain implicit subject variables. The distributed representation method defines the semantic distribution of documents in a variety of topics. Moreover, the traditional unsupervised feature extraction model can not deal with the category tagged document data effectively. Therefore, this paper combines with the distributed characteristics of text topic based on the research of restricted Bolzmann machine. Finally it proposes a semi-supervised distributed topic feature extraction model called NRBM which can effectively use the multi-tag information in the document. Finally,its superiority is proved by comparing with the standard LDA topic model.
作者 江雨燕 桂伟
出处 《计算机工程与应用》 CSCD 北大核心 2017年第23期108-112,共5页 Computer Engineering and Applications
基金 国家自然科学基金(No.71172219) 安徽省自然科学研究项目省级重点项目(No.KJ2011Z039)
关键词 文本数据 概率主题模型 隐含狄利克雷分配 受限玻尔兹曼机 text data probabilistic topic model latent Dirichlet allocation restricted Bolzmann machine
  • 相关文献

参考文献10

二级参考文献275

  • 1王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93. 被引量:20
  • 2李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 3樊兴华,孙茂松.一种高性能的两类中文文本分类方法[J].计算机学报,2006,29(1):124-131. 被引量:70
  • 4张华平.计算所汉语词法分析系统ICTCLAS[EB/OL].[2002-08-16].http://www.nip.org.cn/project/project.php?pwj_id=6.
  • 5Hristovski D,Friedman C,Rindflesch T C,et al.Literat-ure-Based Knowledge Discovery using Natural Language Processing[J].Literature-based Discovery,Information Science and Knowledge Management,2008(15):133-152.
  • 6Sayyadi H,Getoor L.FutureRank:Ranking Scientific Articles by Predicting their Future PageRank[C] //Proceedings of the 9th SIAM International Conference on Data Mining,2009:533-544.
  • 7Blei D M,Ng A Y,Jordan M I.Latent Dirichlet Allocation[J].Journal of Machine Learning Research,2003(3):993-1022.
  • 8Erosheva E,Fienberg S,Lafferty J.Mixed-membership Models of Scientific Publications[C] //Proceedings of the National Academy of Sciences,2004(101):5220-5227.
  • 9Nallapati R M,Ahmed A,Xing E P,et al.Joint Latent Topic Models for Text and Citations[C] //Proceeding of the 14th international conference on Knowledge Discovery and Data Mining,2008:542-550.
  • 10Blei D M,Lafferty J D.Dynamic Topic Model[C] //Proceedings of the 23rd international conference on Machine Learning,2006(48):113-120.

共引文献2932

同被引文献51

引证文献5

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部