摘要
从信息分析的实际需求出发,对与电动汽车相关的5 405条专利数据进行术语抽取、生僻术语识别和字段比较研究。结果显示关键短语抽取的方法可行,互信息抽取的术语所在文档的平均文档长度更接近集合的平均文档长度;摘要和First Claim字段的术语存在一定差别,但对分类或聚类同等重要;生僻术语识别算法能够发现生僻词和高频词的对应关系。研究结论可以为专利文本挖掘和专利信息分析提供结果和方法,并为信息分析工作提供所需的参考术语。
Based on the actual need of information analysis, this paper studies 5405 pieces of patent data about electric vehicle by term extraction, rare term recognition and field comparison. The result reveals that key-phrase extraction is feasible ; that average length of the documents containing terms ranked by mutual information is closer to one of collection; that terms in abstract and those in first claim are different to a certain extent, but of equal importance to text categorization/clustering; the algorithm of rare term recognition can find the corresponding relationship between rare words and high frequency words. This paper provides results and methods for patent text mining and patent information analysis, and provides reference for information analysis.
出处
《图书情报工作》
CSSCI
北大核心
2013年第1期130-135,共6页
Library and Information Service
基金
第51批中国博士后科学基金面上资助一等资助项目"科技文本信息资源中术语抽取与基于术语的分类与聚类"(项目编号:2012M510040)
中国科学技术信息研究所学科建设项目"自然语言处理"(项目编号:XK2012-6)研究成果之一
关键词
术语抽取
文本挖掘
专利
信息分析
term extraction text mining patent information analysis