期刊文献+

面向物资分类的中文字符串相似度计算方法 被引量:1

Material Classification-Oriented Chinese String Similarity Metric
下载PDF
导出
摘要 物资分类是企业物资管理的一项基础工作,在大型企业中,物资数量巨大且类别繁多,所以需要借助计算机自动分类技术提高物资分类的效率。在自动分类的过程中,物资名称相似度是影响分类效果的关键因素之一。在分析了物资名称字符串特点和Jaro—Winkle算法的基础上,提出了一种基于动态权重的中文字符串相似度计算方法。通过在真实物资分类数据集上的实验,验证了这种相似度的计算方法可以有效提高物资分类的准确度。 Material classification plays a fundamental role in enterprise material management, while the huge amount of materials and categories make it impossible to accomplish the task by manual editing. Therefore it is important to integrate automatic classification methodologies into enterprise material classification. In the process of automatic material classification, the material name similarity metric is essential; however traditional string similarity metrics are not suitable for Chinese material names. In this paper, after evaluating the Jaro-Winkle algorithm, a novel material classification- oriented Chinese string similarity metric is proposed by estimating the weights of the suffixes in Chinese material names dynamically. Finally, the experiment resuhs on a real dataset of Chinese Materials are reported, which shows that the dynamic-weighting based string similarity metric outperforms the traditional metrics.
作者 韩建国 巩军
机构地区 神华集团
出处 《情报学报》 CSSCI 北大核心 2012年第7期709-714,共6页 Journal of the China Society for Scientific and Technical Information
关键词 字符串相似度 自动分类 物资分类 string similarity, automatic classification, material classification
  • 相关文献

参考文献15

  • 1Yang Yiming, Liu Xin. A re-examination of text categoriz- ation methods [ C ]//Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, 1999,42-49.
  • 2Cohen W W, Ravikumar P, Fienberg S. A Comparison of String Metrics for Matching Names and Records [ C l// Proceedings of KDD Workshop on Data Cleaning and Object Consolidation, 2003 : 73 -78.
  • 3Jaro M A. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida [ J ]. Journal of the American Statistical Association, 84 : 414-420.
  • 4Jaro M A. Probabilistic linkage of large public health data files ( disc: P687-689 ) [ J ]. Statistics in Medicine, 14 : 491-498.
  • 5Winkler W E. The state of record linkage and current research problems, Statistics of Income Division, Internal Revenue Service Publication R99/04 [ EB/OL]. [ 2012- 01-02 ]. http://www, census, gov/srd/www/byname. html.
  • 6Monge A, Elkan C. The field-matching problem : algorithm and applications [ C ]//Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996.
  • 7Monge A, Elkan C. An efficient domain-independent algorithm for detecting approximately duplicate database records [ C]//The proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, 1997.
  • 8Piskorski J, Wieloch K, Pikula M, et al. Towards Person Name Matching for Inflective Languages [ C ]//WWW 2008 Workshop NLP Challenges in the Information Explosion Era ,2008.
  • 9Arehart M D, Miller K J. A Ground Truth Dataset for Matching Culturally Diverse Romanized Person Names [ C ]//Language Resources and Evaluation Conference, Marrakesh, Morocco, 2008.
  • 10张晓孪,王西锋.基于概念图的汉语语义计算的研究与实现[J].计算机工程与应用,2011,47(10):120-123. 被引量:10

二级参考文献16

  • 1吴健,吴朝晖,李莹,邓水光.基于本体论和词汇语义相似度的Web服务发现[J].计算机学报,2005,28(4):595-602. 被引量:218
  • 2朱嫣岚,闵锦,周雅倩,黄萱菁,吴立德.基于HowNet的词汇语义倾向计算[J].中文信息学报,2006,20(1):14-20. 被引量:326
  • 3李峰,李芳.中文词语语义相似度计算——基于《知网》2000[J].中文信息学报,2007,21(3):99-105. 被引量:106
  • 4董振东,董强,郝长伶.知网的理论发现[J].中文信息学报,2007,21(4):3-9. 被引量:99
  • 5Green, Rebecca and Bonnie J. Dorr. Inducing A Semantic Frame Lexicon from WordNet Data[C]//Proceedings of the 2nd Workshop on Text Meaning and Interpretation (ACL 2004).
  • 6Dagan L, Lee L. and Pereira F. (1999), Similarity- based models of word cooccurrence probabilities[C]//. Machine Learning, Special issue on Machine Learning and Natural Language, 1999.
  • 7董振东,董强.《知网》[DB/OL].http://www.keen-age.com.
  • 8Dekang I.in. An Information Theoretic Definition of Similarity Semantic distance in WordNet [C]//Proceedings of the Fifteenth International Conference on Machine Learning. 1998.
  • 9Eneko Agirre, German Rigau. A Proposal for Word Sense Disambiguation using Conceptual Distance[C]// Proceedings of the First International Conference on Recent Advanced in NL P. 1995.
  • 10BUDANITSKY, A. AND HIRST, G. Semantic dis- tance in WordNet: An experimental, application oriented evaluation of five measures[C]//Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics. 2001.

共引文献80

同被引文献8

引证文献1

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部