摘要
专利术语自动抽取是知识抽取与文本挖掘的关键环节。在构建专利文献停用词表以及提取特定规则的基础上,抽取候选专利术语;通过分析专利术语与其所在句子的关联关系、相邻专利术语之间的影响以及常识性词语对专利术语抽取的干扰,分别提出基于PageRank思想的STRank权重计算方法、专利术语区别度计算方法以及知网义原信息降权方法,并融合上述方法对专利术语进行抽取。采用传感器领域的专利文献进行实验,在top-1400、top-1600级别上正确率为80.5%、79.7%,相对比CS+CC+CD方法分别提高了11.4%、9.5%。实验结果证明该多策略融合方法的有效性。
Patent terms auto-extraction plays an important role in knowledge extraction and text mining. In this paper we extract candidate patent terms on the basis of constructing the stop-words inventory of patent literatures and specific rules extraction. Through analysing the associated relationship between patent terms and the sentences where they are, the influences between the adjacent patent terms and the interference of general words on patent terms extraction, we propose respectively the PageRank idea-based STRank weight calculation algorithm, the patent terms distinction computation technique and the weight-dropping method using Hownet sememe information, the above methods are then integrated to extract the patent terms. Patent literatures of sensor field are chosen for experiment, the precisions of top-1400 and top-1600 level are 80.5% and 79. 7% respectively, increasing 11. 4% and 9.5% in contrast to the result of CS + CC + CD method. The experimental results prove the effectiveness of this multi-strategy integration method.
出处
《计算机应用与软件》
CSCD
2015年第2期28-32,共5页
Computer Applications and Software
基金
国家自然科学基金项目(61171159
61271304)
北京市教委科技发展计划重点项目暨北京市自然科学基金B类重点项目(KZ201311232037)