期刊文献+

利用上下文提高文本聚类的效果 被引量:9

Improved Text Clustering Using Context
下载PDF
导出
摘要 传统文本聚类的向量空间模型中,认为词的权重只和词频有关,而与词语出现的上下文无关。本文介绍了如何借助按词语之间关系组织的本体论词典对文章进行上下文分析,得到文章中词语之间意义上的相互关系,进而用相关词语的词频以及关系的权重量化地给出一个词语受到上下文的支持程度,所以在衡量词语权重时不仅考虑其词频,而且考虑上下文的支持情况。文章还介绍了如何用自动构建的方法得到本文所需的词典,使得在本体论词典资源还不太丰富的汉语中也能应用上面的方法。实验数据表明,本文的方法能有效的消除噪音,提高文本聚类的效果。 In traditional text clustering with Vector Space Model, only frequency is used as the weight of a word, while the context is not taken into account. In this paper we describe a method to weight the supporting degree of the context to a word using relations between words, which are captured by an ontology dictionary. The supporting degree is weighted by both the frequencies of related words and the weights of the relations. A general methodology for automatically structuring the ontology dictionary is also given in this paper. Experiments show that our method tends to effectively reduce the noise, and performs better than the traditional method.
出处 《中文信息学报》 CSCD 北大核心 2007年第6期109-115,共7页 Journal of Chinese Information Processing
关键词 计算机应用 中文信息处理 文本聚类 上下文 词语权重 本体论词典 computer application Chinese information processing text clustering context weight ontology dictionary
  • 相关文献

参考文献17

  • 1刘远超,王晓龙,徐志明,关毅.文档聚类综述[J].中文信息学报,2006,20(3):55-62. 被引量:65
  • 2P Brezillon. Context in problem solving: a survey[J]. The Knowledge Engineering Review. 1999, 14:47- 80.
  • 3Leiguang Gong. Exploring Computational Mechanism for Contexts[J]. IEEE Computational Intelligence Bulletin. 2002, 1(1): 19-25.
  • 4Christos H. Papadimitriou. Latent Semantic Indexing: A Probabilistic Analysis[J]. Journal of Computer and System Sciences. 2000, 61(2) : 217-235.
  • 5TH Haveliwala. Topic-sensitive pagerank: A contextsensitive ranking algorithm for web search[J]. IEEE Transactions on Knowledge and Data Engineering. 2003, 15(4): 784-796.
  • 6Weissman, Adam J., Elbaz, Gilad Israel. Meaningbased information organization and retrieval[P]. United States Patent and Trademark Office. 2002, US Patent 6,453,315.
  • 7Bhoopesh Choudhary, Pushpak Bhattacharyya. Text Clustering Using Semantics[A]. In: The 11th International World Wide Web Conference [C].USA:WWW2002, 2002.
  • 8Christiane Fellbaum. WordNet: An Electronic Lexical Database[M].Cambridge: The MIT Press, May 1998
  • 9S Brin, L Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine[A]. In: Proceedings of the seventh international conference on World Wide Web 7 [C]. Australia, 1998, 107-117.
  • 10T Pedersen, S Patwardhan, J Michelizzi. Wordnet: Similarity-Measuring the relatedness of concepts[A]. In: Proceedings of the Nineteenth National Confer ence on Artificial Intelligence[C]. Boston: 2004 267-270.

二级参考文献63

  • 1李素建,王厚峰,俞士汶,辛乘胜.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9):1192-1197. 被引量:93
  • 2陈浩,何婷婷,姬东鸿.基于k-means聚类的无导词义消歧[J].中文信息学报,2005,19(4):10-16. 被引量:16
  • 3郑家恒,卢娇丽.关键词抽取方法的研究[J].计算机工程,2005,31(18):194-196. 被引量:41
  • 4王军.词表的自动丰富——从元数据中提取关键词及其定位[J].中文信息学报,2005,19(6):36-43. 被引量:40
  • 5边肇祺 张学工.模式识别[M].北京:清华大学出版社,2001..
  • 6Heid, U. Extracting Terminologically Relevant Collocations from German Technical Texts[A]. Proceedings Fifth International Congress on Terminology and Knowledge Engineering[C], 23-27 August 1999:241-255.
  • 7Church, K.W. & Hanks P. Word Association Norms Mutual Information and Lexicography[J]. Computational Linguistics, 1990,16(1):23-29.
  • 8Dover, New York. Dunning, Ted. Accurate methods for the statistics of surprise and coincidence[J]. Computational Linguistics, 1993,19(1):61-74.
  • 9Smadja, F. Retrieving Collocations From Text. XTRACr[J]. Computational Linguistics, 1993,19(1): 143-177.
  • 10Shimohata, S. Retrieving Collocations by Co-occurrences and Word Order Constraints[C]. Proceedings of ACL-EACL'97, 1997: 476-481.

共引文献152

同被引文献92

引证文献9

二级引证文献67

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部