期刊文献+

一种面向科技文献元数据增量数据规范的多模式匹配算法 被引量:1

A Multiple Pattern Matching Algorithm for Specifications of Incremental Metadata for Sci-Tech Literature
原文传递
导出
摘要 【目的】针对期刊文献元数据日增的小规模数据,设计一种基于Hash的多模式匹配算法,对其机构信息利用大规模的模式集进行规范化。【方法】使用Hash定位模式串,减少对系统内存的占用;抽取模式串的首个单词/字结合Word跳步匹配,减少匹配次数,加大跳转幅度,从而提升多模式匹配的效率。【结果】以CSCD机构库182万条数据作为模式集的实验中,该算法与Aho-Corasick(AC)算法对比,能够较为快速地构建模式集对应的字典;在字符集规模约为1万条时,有更优越的时间性能,尤其是英文语料下有9.39%时间性能的提升;与Wu-Manber(WM)算法相比,该算法不受最短模式串限制。【局限】针对不同的模式集和字符集,需要对算法或数据进行调整;该算法及其拓展的无首词模式,均不适用于模式集较小、字符集较大的场景。【结论】该算法可以应用于中文、英文、中英混合的文本,在模式集较大(106级)、字符集较小(1万左右)的情况下,有超越经典算法AC算法(0.08%-30.41%)和WM算法时间性能的表现。 [Objective]This paper designs a multiple pattern matching algorithm to standardize the institutional information of sci-tech literature metadata.[Methods]First,we used the Hash function to locate the pattern strings and reduced the system memory usage.Then,we extracted the first words of the pattern strings,which were combined with word skipping matching.The new algorithm reduced the number of matches and increased the jump range,which improved the efficiency of multiple pattern matching.[Results]We examined our model with the CSCD’s institutional library as the pattern string set.Compared with the Aho-Corasick(AC)algorithm,our method quickly constructed the dictionary corresponding to the pattern string sets.When the data volume reached about 10000,our model spent less time on the same tasks.For the English corpus,there was a 9.39%improvement in time performance.Compared with the Wu-Manber(WM)algorithm,our method was not restricted by the shortest pattern strings.[Limitations]The algorithm or data needs to be adjusted for different pattern strings and text strings.This algorithm and the extended headless mode are not suitable for small pattern string sets with large string sets.[Conclusions]The algorithm can be applied to Chinese,English,and ChineseEnglish mixed texts.The time performance of our algorithm is superior to the AC and WM algorithms in processing large pattern string set(106)and small string set(about 10,000).
作者 董美 常志军 张润杰 Dong Mei;Chang Zhijun;Zhang Runjie(National Science Library,Chinese Academy of Sciences,Beijing 100190,China;Department of Library,Information and Archives Management,School of Economics and Management,University of Chinese Academy of Sciences,Beijing 100190,China;Electronics and Computer Science,University of Southampton,Southampton SO171BJ,UK)
出处 《数据分析与知识发现》 CSSCI CSCD 北大核心 2021年第6期135-144,共10页 Data Analysis and Knowledge Discovery
基金 中国科学院文献情报能力建设项目(项目编号:Y9100901)的研究成果之一。
关键词 模式匹配 数据规范化 名称规范 哈希算法 Pattern Match Data Standardization Name Authority Hash Algorithm
  • 相关文献

参考文献18

二级参考文献79

  • 1郑庆良,张翔,杨莹.网络服务器模型分析与实现[J].杭州电子工业学院学报,2004,24(4):95-98. 被引量:4
  • 2陈耿,朱玉全,杨鹤标,陆介平,宋余庆,孙志挥.关联规则挖掘中若干关键技术的研究[J].计算机研究与发展,2005,42(10):1785-1789. 被引量:62
  • 3刘琦,卜佳俊,陈纯.基于Apriori算法的关键词推荐在面向主题的用户个性化搜索中的应用[J].模式识别与人工智能,2006,19(2):186-190. 被引量:5
  • 4闵联营,赵婷婷.BM算法的研究与改进[J].武汉理工大学学报(交通科学与工程版),2006,30(3):528-530. 被引量:19
  • 5蔡晓妍,戴冠中,杨黎斌.改进的多模式字符串匹配算法[J].计算机应用,2007,27(6):1415-1417. 被引量:11
  • 6AHO A V, CORASICK M J. Efficient string matching: an aid to bibliographie search [ J ]. Communications of the ACM, 1975,18 ( 6 ) : 333- 340.
  • 7TAN Lin, SHERWOOD T. A high throughput string matching architecture for intrusion detection and prevention [ C]//Proc of the 32nd International Symposium on Computer Architecture. 2005 : 112-122.
  • 8DHARMAPURIKAR S, LOCKWOOD J. Fast and scalable pattern matching for network intrusion detection systems[ J ]. IEEE Journal on Selected Areas in Communications, 2006,24 ( 10 ) : 1781- 1792.
  • 9PIYACHON P,LUO Yah. Design of high performance pattern marching engine through compact deterministic finite automata[ C ]//Proc of the 45th Annual Design Automation Conference. New York: ACM Press, 2008 : 852 - 857.
  • 10TUCK N, SHERWOOD T,CALDER T, et al. Deterministic memory- efficient string matching algorithms for intrusion detection [ C ]//Proc of the 23rd Annual Joint Conference of IEEE Computer and Communications Societies. New Jersey:IEEE Press,2004:2628-2639.

共引文献94

同被引文献6

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部