摘要
【目的】针对仅利用文本信息计算科技文档相似度存在的不足,提出一种结合文本和公式信息计算科技文档相似度的方法。【方法】将单个公式的特征元素映射为位置向量,计算得到单个公式的相似度;计算文档间的公式覆盖度和相似度;结合文本和公式信息计算得到科技文档相似度。【结果】比较本文方法和传统向量空间方法的分类性能,结果显示本文方法在宏平均F值上最大可提高6.7%。【局限】没有包含文档公式信息的公开测试集,自行构建的数据集规模较小。【结论】结合公式信息计算文档相似度,不仅能有效提高文档相似度计算的准确性,而且可以实现跨语言文档的相似度计算。
[Objective] This paper proposes a new method to calculate the similarity of science and technology documents combining the information of texts and formulas, aiming to improve the performance of traditional methods.[Methods] Firstly, we mapped feature elements of single formula into position vector, which helped us calculate the similarity of single formula. Secondly, we computed the coverage and similarity of formula between documents. Finally, the similarity of science and technology documents were calculated by combining information of texts and formulas. [Resultsl We compared the classification results of the new method and the traditional ones. We found that the macro average F-score of the new method was increased by 6.7%. [Limitations] The test sets do not collect formula information of documents, which need to be expanded. [Conclusions] The new method could calculate document similarity more accurately.
作者
徐建民
许彩云
Xu Jianmin;Xu Caiyun(School of Cyber Security and Computer,Hebei University,Baoding 071002,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2018年第10期103-109,共7页
Data Analysis and Knowledge Discovery
基金
河北省自然基金项目"基于贝叶斯网络的话题识别与追踪方法研究"(项目编号:2015201142)
国家社会科学基金后期资助项目"基于术语关系的贝叶斯网络检索模型扩展"(项目编号:17FTQ002)的研究成果之一
关键词
公式相似度
文档相似度
覆盖度
科技文档
Formula Similarity
Document Similarity
Coverage Degree
Scientific and Technical Documents