期刊文献+

一种结合TF-IDF和Simhash的科技项目文本相似性度量方法 被引量:5

An approach for text similarity measurement of science and technology projects combing TF-IDF and Simhash
下载PDF
导出
摘要 为了提高科技项目文本相似性度量的准确性和性能,将TF-IDF和Simhash相结合,提出了一种新的科技项目文本相似性度量方法。首先,该方法对科技项目文本进行预处理得到词项集合,再使用TF-IDF计算词项集合中每个词项的权重值,并选取具有较高权重值的重要词项;其次,使用Simhash把重要词项映射为固定长度的二进制串,并求和得到文本的Simhash签名;最后,使用汉明距离计算两个Simhash签名间的相似性。实验结果表明,所提方法在查准率、召回率和F度量值方面优于传统的Simhash算法和TF-IDF方法。 To enhance the accuracy and performance of text similarity measurement of science and technology projects,this pa‐per proposes a new approach for measuring text similarity of science and technology projects by combining TF-IDF and Simhash.Firstly,this method uses natural language processing technology to preprocess science and technology project texts to get a term set,then uses the TF-IDF method to calculate the TF-IDF value of each term in the term set,and selects the important term with higher TF-IDF value.Secondly,this method uses the Simhash algorithm to get the Simhash signature of the text through mapping the selected important terms into fixed binary strings.Finally,Hamming distance is used to calculate the similarity between two Simhash signatures.Experimental results show that compared to the traditional Simhash and TF-IDF,the proposed method can promote the evaluation metrics of precision,recall and F-measure.
作者 孙北宁 吕维新 曾俊 肖衡 Sun Beining;Lv Weixin;Zeng Jun;Xiao Heng(Department of Science Technology and Data,Yunnan Power Grid Co.,Ltd.,Kunming 650011,China;School of Big Data and Intelligent Engineering,Southwest Forestry University,Kunming 650224,China;Kunming Power Supply Bureau,Yunnan Power Grid Co.,Ltd.,Kunming 650011,China;Yunnan Yundian Tongfang Technology Co.,Ltd.,Kunming 650214,China)
出处 《电子技术应用》 2023年第6期89-93,共5页 Application of Electronic Technique
基金 国家自然科学基金项目(61702442)。
关键词 科技项目文本 文本相似度 TF-IDF Simhash算法 science and technology project text text similarity TF-IDF Simhash
  • 相关文献

参考文献10

二级参考文献62

  • 1刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量:198
  • 2宋韶旭,李春平.基于非对称相似度的文本聚类方法[J].清华大学学报(自然科学版),2006,46(7):1325-1328. 被引量:7
  • 3Fung B C M,Wang K,Ester M.Hierarchical document clustering//Wang John ed.The Encyclopedia of Data Warehousing and Mining,idea Group.2005:970-975.
  • 4Salton G.The SMART Retrieval System-Experiments in Automatic Document Processing.Englewood Cliffs,New Jersey:Prentice Hall Inc,1971.
  • 5Wang Y,Julia H.Document clustering with semantic analysis//Proceedings of the 39th Hawaii International Conferences on System Sciences.Hawaii,US,2006:54-63.
  • 6Hotho A,Staab S,Stumme G.Wordnet improves text document clustering//Proceedings of the Semantic Web Workshop at SIGIR-2003,26th Annual International ACM SIGIR Conference.Toronto,Canada,2003:541-550.
  • 7Hall P,Dowling G.Approximate string matching.Computing Survey,1980,12(4):381-402.
  • 8Coelho T,Calado P,Souza L,Ribeiro-Neto B,Muntz R.Image retrieval using multiple evidence ranking.IEEETransactions on Knowledge and Data Engineering,2004,16(4):408-417.
  • 9Ko Y,Park J,Seo J.Improving text categorization using the importance of sentences.lnformation Processing and Management,2004,40(1):65-79.
  • 10Erkan G,Radev D.Lexrank:Graph-based lexical centrality as salience in text summarization.Journal of Artificial Intelligence Research,2004,22(7):457-479.

共引文献274

同被引文献57

引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部