期刊文献+

基于多策略融合的专利术语自动抽取 被引量:4

PATENT TERM AUTO-EXTRACTION BASED ON MULTI-STRATEGY INTEGRATION
下载PDF
导出
摘要 专利术语自动抽取是知识抽取与文本挖掘的关键环节。在构建专利文献停用词表以及提取特定规则的基础上,抽取候选专利术语;通过分析专利术语与其所在句子的关联关系、相邻专利术语之间的影响以及常识性词语对专利术语抽取的干扰,分别提出基于PageRank思想的STRank权重计算方法、专利术语区别度计算方法以及知网义原信息降权方法,并融合上述方法对专利术语进行抽取。采用传感器领域的专利文献进行实验,在top-1400、top-1600级别上正确率为80.5%、79.7%,相对比CS+CC+CD方法分别提高了11.4%、9.5%。实验结果证明该多策略融合方法的有效性。 Patent terms auto-extraction plays an important role in knowledge extraction and text mining. In this paper we extract candidate patent terms on the basis of constructing the stop-words inventory of patent literatures and specific rules extraction. Through analysing the associated relationship between patent terms and the sentences where they are, the influences between the adjacent patent terms and the interference of general words on patent terms extraction, we propose respectively the PageRank idea-based STRank weight calculation algorithm, the patent terms distinction computation technique and the weight-dropping method using Hownet sememe information, the above methods are then integrated to extract the patent terms. Patent literatures of sensor field are chosen for experiment, the precisions of top-1400 and top-1600 level are 80.5% and 79. 7% respectively, increasing 11. 4% and 9.5% in contrast to the result of CS + CC + CD method. The experimental results prove the effectiveness of this multi-strategy integration method.
出处 《计算机应用与软件》 CSCD 2015年第2期28-32,共5页 Computer Applications and Software
基金 国家自然科学基金项目(61171159 61271304) 北京市教委科技发展计划重点项目暨北京市自然科学基金B类重点项目(KZ201311232037)
关键词 专利术语 术语抽取 PAGERANK 术语区别度 义原信息 Patent term Term extraction PageRank Term distinction Sememe information
  • 相关文献

参考文献13

二级参考文献82

共引文献148

同被引文献56

引证文献4

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部