摘要
形式概念分析作为数据分析和知识处理的形式化工具,可以有效的从海量文本数据中挖掘出人们感兴趣的知识,受到许多研究人员的推崇.形式概念分析的前提条件是必须有一个纯净、良好定义的形式背景.从文本中直接提取特征词,利用文本-特征词形成的文本型形式背景(Textual Formal Context TFC)是一个高度稀疏的二维表,带有很多的噪音信息,严重影响形式概念分析的建格效率以及概念格的结构.因此找到一种有效的文本型形式背景约简方法很有必要.本文综合考虑文本型形式背景的本质特征,从属性语义距离和数学原理出发,提出了一种文本型形式背景的约简方法TFC-Reducing,并给出文本型形式背景约简的评价方法--信息损失熵和语义覆盖度.
As a tool of data analysis and formalizing for knowledge management, Formal Concept Analysis ( FCA ) can effective mine knowledge interested for people from lager textual data, and which are held in esteem by many researchers. The premise of FCA is that need a pure and well defined formal context. Extracting characteristic word directly from the text and exploiting document with characteristic words to form textual formal context ( TFC ), which lead to generating a highly sparse two-dimensional table with a lot of noise. It seriously affects efficiency of building concept lattice and the structure of lattice. Therefore, it is necessary to find an effective method for reducing the textual formal context. Comprehensively considering the nature of textual formal context in this paper, we propose a method named TFC-Reducing for the reduction of textual formal context from the view of semantic distance between attributes and mathematical theory, and give a method for evaluating reduction of textual formal context, named as information losses entropy ILE and semantic coverage SC.
出处
《小型微型计算机系统》
CSCD
北大核心
2012年第10期2170-2176,共7页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(70871115)资助
关键词
文本型形式背景
语义距离
属性约简
领域主题词表
textual formal context
semantic distance
attribute reduction
domain thesaurus