期刊文献+

基于主题模型的通用文本匹配方法

GENERAL TEXT MATCHING BASED ON TOPIC MODEL
下载PDF
导出
摘要 检测长文本和短文本相似性的应用场景越来越多,文本对的一致性检测大多可以统一抽象成文本相似性的比较问题。该问题的难点在于短文本是零散的,从而很难判断其属于哪个领域及其背景知识,也难以引入词嵌入来解决在通用场景的具体文本匹配问题。基于这个问题,提出一种新的基于文本聚类主题模型的轻量方法,不需要利用额外的背景知识来匹配通用文本相似性。在两个经典测试样本数据集上的实验结果表明,该方法的文本相似性检测效率非常高。 The similarity measurement between a long text and a short text relatively has more and more application scenarios,and the consistency judgment on these text pairs can be abstracte as a comparison problem of text similarity.The challenge is that the short text is sparse,it is difficult to determine which domain it belongs to and it is also difficult to introduce word embedding to solve the specific text matching problem in general scenarios.Aiming at this problem,this paper proposes a lightweight approach based on topic model with text clustering which can match generalized longshort texts without using extra related background knowledge.The experimental results on two typical test sample datasets show the text similarity detection efficiency of the proposed method is very high.
作者 黄振业 莫淦清 余可曼 Huang Zhenye;Mo Ganqing;Yu Keman(School of Information Technology,Zhejiang Financial College,Hangzhou 310018,Zhejiang,China;Hangzhou Pingzhi Information Technology Co.,Ltd.,Hangzhou 310030,Zhejiang,China)
出处 《计算机应用与软件》 北大核心 2024年第5期310-318,349,共10页 Computer Applications and Software
关键词 自然语言处理 文本匹配 主题模型 吉布斯采样 Natural language processing Text matching Topic model Gibbs sampling
  • 相关文献

参考文献3

二级参考文献63

  • 1侯汉清.分类法的发展趋势简论[J].情报科学,1981,2(1):58-63. 被引量:15
  • 2李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:96
  • 3尚文倩,黄厚宽,刘玉玲,林永民,瞿有利,董红斌.文本分类中基于基尼指数的特征选择算法研究[J].计算机研究与发展,2006,43(10):1688-1694. 被引量:38
  • 4Deerwester S C, Dumais S T, Landauer T K, et al. Indexing by latent semantic analysis [J]. Journal of the Association of Information Sience, 1990, 41(6) : 391-407.
  • 5Song Y, Wang H, Wang Z, et al. Short text conceptualization using a probabilistic knowledgebase [C]// Proc of the 22nd Int Joint Conf on Artificial Intelligence (IJCAI). Palo Alto, CA: AAAI, 2011:2330-2336.
  • 6Wang Z, Zhao K, Wang H, et al. Query understanding through knowledge-based conceptualization [C]//Proc of the 24th Int Joint Conf on Artificial Intelligence (IJCAI). Palo Alto, CA: AAAI, 2015:3264-3270.
  • 7Lund K, Burgess C. Producing high-dimensional semantic spaces from lexical co-occurrence[J]. Behavior Research Methods, Instruments,& Computers, 1996, 28(2): 203- 2O8.
  • 8Turney P D, Pantel P. From frequency to meaning: Vector space models of semantics [J]. Journal of Artificial Intelligence Research, 2010, 37(1): 141-188.
  • 9Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model [J]. The Journal of Machine Learning Research, 2003, 3(2): 1137-1155.
  • 10Mikolov T, Karafiat M, Burget L, et al. Recurrent neural network based language model [C] //Proc of the llth Annual Conf of the Int Speech Communication Association. New York: ACM, 2010: 1045-1048.

共引文献85

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部