摘要
【目的】针对期刊文献元数据日增的小规模数据,设计一种基于Hash的多模式匹配算法,对其机构信息利用大规模的模式集进行规范化。【方法】使用Hash定位模式串,减少对系统内存的占用;抽取模式串的首个单词/字结合Word跳步匹配,减少匹配次数,加大跳转幅度,从而提升多模式匹配的效率。【结果】以CSCD机构库182万条数据作为模式集的实验中,该算法与Aho-Corasick(AC)算法对比,能够较为快速地构建模式集对应的字典;在字符集规模约为1万条时,有更优越的时间性能,尤其是英文语料下有9.39%时间性能的提升;与Wu-Manber(WM)算法相比,该算法不受最短模式串限制。【局限】针对不同的模式集和字符集,需要对算法或数据进行调整;该算法及其拓展的无首词模式,均不适用于模式集较小、字符集较大的场景。【结论】该算法可以应用于中文、英文、中英混合的文本,在模式集较大(106级)、字符集较小(1万左右)的情况下,有超越经典算法AC算法(0.08%-30.41%)和WM算法时间性能的表现。
[Objective]This paper designs a multiple pattern matching algorithm to standardize the institutional information of sci-tech literature metadata.[Methods]First,we used the Hash function to locate the pattern strings and reduced the system memory usage.Then,we extracted the first words of the pattern strings,which were combined with word skipping matching.The new algorithm reduced the number of matches and increased the jump range,which improved the efficiency of multiple pattern matching.[Results]We examined our model with the CSCD’s institutional library as the pattern string set.Compared with the Aho-Corasick(AC)algorithm,our method quickly constructed the dictionary corresponding to the pattern string sets.When the data volume reached about 10000,our model spent less time on the same tasks.For the English corpus,there was a 9.39%improvement in time performance.Compared with the Wu-Manber(WM)algorithm,our method was not restricted by the shortest pattern strings.[Limitations]The algorithm or data needs to be adjusted for different pattern strings and text strings.This algorithm and the extended headless mode are not suitable for small pattern string sets with large string sets.[Conclusions]The algorithm can be applied to Chinese,English,and ChineseEnglish mixed texts.The time performance of our algorithm is superior to the AC and WM algorithms in processing large pattern string set(106)and small string set(about 10,000).
作者
董美
常志军
张润杰
Dong Mei;Chang Zhijun;Zhang Runjie(National Science Library,Chinese Academy of Sciences,Beijing 100190,China;Department of Library,Information and Archives Management,School of Economics and Management,University of Chinese Academy of Sciences,Beijing 100190,China;Electronics and Computer Science,University of Southampton,Southampton SO171BJ,UK)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2021年第6期135-144,共10页
Data Analysis and Knowledge Discovery
基金
中国科学院文献情报能力建设项目(项目编号:Y9100901)的研究成果之一。
关键词
模式匹配
数据规范化
名称规范
哈希算法
Pattern Match
Data Standardization
Name Authority
Hash Algorithm