期刊文献+

基于机器学习的扫描图书元数据自动抽取研究 被引量:4

Automatic Metadata Extraction of Scanned Books Based on Machine Learning
下载PDF
导出
摘要 在对纸本图书数字化加工过程中,元数据录入是必需的环节,然而手工录入工作量大、效率低,针对这一问题,提出了一种基于机器学习的扫描图书元数据自动获取方法。首先定义元数据的描述、管理和结构元素,然后以扫描页面的DjVuXML文档为数据源,分析页面的格式、结构等特征,以行作为初始特征向量,采用基于有监督的机器学习方法进行元数据抽取,实验表明该算法能够取得较高的准确率和召回率,能够显著的提高图书数字化的效率。 In digital processing of paper books, input of metadata is required. However manual entry is heavy, ineffi-cient. To solve this problem, presented an automatic metadata extraction method to scanned books based on machine learning. First, defined metadata elements composed of description, management and structure element. Then for the data source, that was DjVu XML document, analysised format, structure features of scanned page. To line as initial features vector, used rule-based and supervised machine learning to extract metadata. Experiments show that the algorithm can achieve a fine accuracy and recall rate, while significantly improves the efficiency of digital process of collection.
出处 《现代情报》 CSSCI 2013年第6期45-48,共4页 Journal of Modern Information
基金 河北省秦皇岛市科学技术研究与发展计划项目(201101A087)
关键词 馆藏图书 数字化 元数据抽取 特征分析 信息抽取 collecllon of books digitization metadata extraction feature analysis information extraction
  • 相关文献

参考文献9

  • 1徐维,胡吉兵,管志宇.元数据概念的产生、发展与成熟[J].中国档案,2003(8):43-44. 被引量:13
  • 2Jiangde Yu,Xiaozhong Fan.Metadata Extraction from Chinese Research Papers Based on Conditional Random Fields[C].Fourth International Conference on Fuzzy Systems and Knowledge Discovery,2007,497-501.
  • 3李朝光,张铭,邓志鸿,杨冬青,唐世渭.论文元数据信息的自动抽取[J].计算机工程与应用,2002,38(21):189-191. 被引量:38
  • 4Y.Hu,H.Li,Y.Cao,et.Automatic extraction of titles from general documents using machine learning[C].In JCDL'05:Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries,2005:145-154.
  • 5徐佳宁.DC元数据在网络资源学科导航体系中的应用研究[J].图书馆建设,2002(1):85-87. 被引量:20
  • 6Xiaonan Lu,Brewster Kahle,James Z.Wang,et.A Metadata Generation System for Scanned Scientific Volumes[C].Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries,2008,6:167-176.
  • 7Xiaonan Lu,Brewster Kahle.Automatic metadata generation for scanned scientific volumes[C].Proceeding of the 2008 ACM workshop on Research advances in large digital book repositories,2008,10:57-58.
  • 8H.Han,C.L.Giles,E.Manavoglu,et.Automatic document metadata extraction using support vector machines[C].In JCDL'03: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries,2003:37-48.
  • 9V.I.Levenshtein.Binary codes capable of correcting deletions,insertions,and reversals[J].Soviet Physics Doklady,1966,10(8):707-710.

二级参考文献9

  • 1史建中.DC元数据[M].上海:上海科学技术文献出版社,2000..
  • 2Public Record Office. Management, Appraisal and Preservation of Electronic Records
  • 3Sue McKemmish, Glenda Acland, etc. Describing Records in Context in the Continuum: Th eAustralian recordkeeping Metadata Schema. Archivaria.2000. 48
  • 4David Wallance. Metadata and Archival Management of Electronic Records. Archivaria. 1993. 36
  • 5ICA. Guide for Managing Electronic Records from an Archival Perspective. 1997. p20.
  • 6National Achives of Australia. Recordkeeping Metadata Standard for Commonwealth Agencies. 1999. pT.
  • 7黄豫清,戚广志,张福炎.从WEB文档中构造半结构化信息的抽取器[J].软件学报,2000,11(1):73-78. 被引量:47
  • 8林海青.数字化图书馆的元数据体系[J].中国图书馆学报,2000,26(4):59-64. 被引量:95
  • 9林蓉,周宁,严亚兰.一种基于事件的都柏林核心(DC)数据模型[J].情报学报,2000,19(3):265-270. 被引量:12

共引文献67

同被引文献33

引证文献4

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部