摘要
文本数据是存储和交换信息最自然的方式,文本挖掘技术可以发现海量文本数据中隐藏的潜在知识模式。研究了文本数据主题挖掘与关联搜索技术,首先通过文本解析提取、分词预处理和索引等进行文本信息处理,然后利用基于潜在语义关系的主题发现模型挖掘大量文本数据中隐藏的主题信息,最后利用主题模型计算关键词间的关联程度进行查询扩展,从而实现关联搜索。实现了一个文本数据挖掘与关联搜索的原型系统,对Tancorp数据集进行主题发现和关联搜索,并以视化和网页同步显示关联搜索的过程。
Text data is the most natural way of storing and exchanging information. Text mining technology can disco-ver knowledge patterns hidden in massive text data. The text data mining and related search technology were studied in the paper, First ly, text information is extracted by text parsing and extraction, word preprocessing and indexing. Then the theme information model based on latent semantic relations is used to mine the hidden topic information in large amount of text data. Finally, the topic model is used to calculate the relevance degree of keywords. In order to achieve the associated search,a prototype system of text data mining and association search is implemented. Subject discovery and association search were performed on Tancorp dataset, and the process of association search was displayed synchro-nously with visualization and Web page.
出处
《计算机科学》
CSCD
北大核心
2017年第B11期411-413,456,共4页
Computer Science
关键词
文本挖掘
主题发现
关联搜索
Text mining,Topic discovery, Association search