摘要
在对纸本图书数字化加工过程中,元数据录入是必需的环节,然而手工录入工作量大、效率低,针对这一问题,提出了一种基于机器学习的扫描图书元数据自动获取方法。首先定义元数据的描述、管理和结构元素,然后以扫描页面的DjVuXML文档为数据源,分析页面的格式、结构等特征,以行作为初始特征向量,采用基于有监督的机器学习方法进行元数据抽取,实验表明该算法能够取得较高的准确率和召回率,能够显著的提高图书数字化的效率。
In digital processing of paper books, input of metadata is required. However manual entry is heavy, ineffi-cient. To solve this problem, presented an automatic metadata extraction method to scanned books based on machine learning. First, defined metadata elements composed of description, management and structure element. Then for the data source, that was DjVu XML document, analysised format, structure features of scanned page. To line as initial features vector, used rule-based and supervised machine learning to extract metadata. Experiments show that the algorithm can achieve a fine accuracy and recall rate, while significantly improves the efficiency of digital process of collection.
出处
《现代情报》
CSSCI
2013年第6期45-48,共4页
Journal of Modern Information
基金
河北省秦皇岛市科学技术研究与发展计划项目(201101A087)
关键词
馆藏图书
数字化
元数据抽取
特征分析
信息抽取
collecllon of books
digitization
metadata extraction
feature analysis
information extraction