摘要
随着大数据时代的来临,如何有效从海量的文本数据中挖掘和分析主题特征已成为学者们的研究重点。隐含狄利克雷分配(Latent Dirichlet Allocation,LDA)作为经典的概率主题模型,因其自身优越的文本分析能力被广泛应用。然而,该模型大多以包含隐含主题变量的有向图的形式存在,实现文档的表达具有局限性。而分布式表示方法定义文档的语义分布在多个主题中并由多主题特征相乘得到;且由于传统的无监督特征提取模型无法有效处理含类别标记的文档数据,故在研究受限玻尔兹曼机(Restricted Bolzmann Machine,RBM)的基础上,结合文本主题的分布式特性,提出了基于RBM的分布式主题特征提取模型NRBM,其自身作为典型的半监督模型能够有效利用文档中的多标记信息。最终与标准LDA主题模型的对比实验证明了NRBM模型的优越性。
With the advent of the era of big data, it has become the focus of scholars how to effectively explore and analyze the topic characteristics from a large amount of text data. Latent Dirichlet allocation as a classical probability theme model is widely used because of its superior text analysis ability. However, most of the models exist in the form of directed graphs which contain implicit subject variables. The distributed representation method defines the semantic distribution of documents in a variety of topics. Moreover, the traditional unsupervised feature extraction model can not deal with the category tagged document data effectively. Therefore, this paper combines with the distributed characteristics of text topic based on the research of restricted Bolzmann machine. Finally it proposes a semi-supervised distributed topic feature extraction model called NRBM which can effectively use the multi-tag information in the document. Finally,its superiority is proved by comparing with the standard LDA topic model.
出处
《计算机工程与应用》
CSCD
北大核心
2017年第23期108-112,共5页
Computer Engineering and Applications
基金
国家自然科学基金(No.71172219)
安徽省自然科学研究项目省级重点项目(No.KJ2011Z039)