摘要
随着信息化应用程度不断提高,企业中越来越多的文本信息被存储在PDF文档中,PDF文档的数量也随之变大,为了帮助用户尽快找到需求的文本信息,并建立企业知识共享平台,本文设计了一种基于文本信息的PDF文档管理系统。首先,针对PDF文档文本信息利用率不足的问题,研究了基于Stream流的PDF文档解析方案,该方案可以用于PDF文档检索模块来进行PDF文本内容解析;其次,针对TF-IDF算法的天然缺陷,从词频、文本长度和关键词位置进行改进,进而计算得到关键词的权重值,再通过空间向量模型计算PDF文档的相似度,按照文档的权重值进行排序;最后,进行系统与功能的验证,证明了本文PDF文档管理系统内容查询具有更高准确性,为企业级智能文档管理平台提供有效和实用的方案。
With the continuous improvement of information application,more and more text information in enterprises is stored in PDF documents,and the number of PDF documents is also increased.In order to help users find the required text information as soon as possible and establish the enterprise knowledge sharing platform,a PDF document management system based on text information is designed.Firstly,according to the problem of insufficient utilization of PDF document text information,the PDF document parsing scheme based on Stream is studied.This scheme can be used in PDF document retrieval module to parse PDF text content.Secondly,in view of the natural defects of the TF-IDF algorithm,the algorithm is improved from three aspects:word frequency,text length,and keyword position.So the weight value of the keywords is calculated.Then,the similarity of the PDF document is calculated by the space vector model,sort by weight value.Finally,the system and function are verified,which proves that the content query of the PDF document management system in this paper has higher accuracy,and provides an effective and practical solution for the enterprise-level intelligent document management platform.
作者
王春伟
侯方
申升
南赛
李英伟
WANG Chunwei;HOU Fang;SHEN Sheng;NAN Sai;LI Yingwei(School of Information Science and Engineering,Yanshan University,Qinhuangdao,Hebei 066004,China;Beijing Branch,Daqing Oilfield Information Technology Company,Beijing 100043,China)
出处
《燕山大学学报》
CAS
北大核心
2020年第6期603-608,共6页
Journal of Yanshan University
基金
国家自然科学基金资助项目(61827811)。
关键词
文本信息
文件解析
文档检索
权重值
text information
file parsing
document retrieval
weight value