期刊文献+

一种基于语义内积空间模型的文本聚类算法 被引量:44

A Novel Text Clustering Algorithm Based on Inner Product Space Model of Semantic
下载PDF
导出
摘要 现有数据聚类方法在处理文本数据,尤其是短文本数据时,由于没有考虑词之间潜在存在的相似情况,因此导致聚类效果不理想.文中针对文本数据高维度和稀疏空间的特点,提出了一种基于语义内积空间模型的文本聚类算法.算法首先利用内积空间的定义建立了针对中文概念、词和文本的相似度度量方法,然后从理论上进行了分析.最后通过一个两阶段处理过程,即向下分裂和向上聚合,完成文本数据的聚类.该方法成功用于中文短文本数据的聚类.实验表明相对于传统方法,文中提供的方法聚类质量更好. Due to lack considering the latent similarity information among words, the clustering result using exist clustering algorithms in processing text data, especially in processing short text data, is not ideal. Considering the text characteristic of high dimensions and sparse space, this paper proposes a novel text clustering algorithm based on semantic inner space model. The paper creates similarity method among Chinese concepts, words and text based on the definition of inner space at first, and then analyzes systematically the algorithm in theory. Through a two phrase processes, i.e. top-down "divide" phase and a bottom-up "merge" phase, it finishes the clustering of text data. The method has been applied into the data clustering of Chinese short documents. Extensive experiments show that the method is better than traditional algorithms.
出处 《计算机学报》 EI CSCD 北大核心 2007年第8期1354-1363,共10页 Chinese Journal of Computers
基金 国家自然科学基金(6473051 60503037) 中国博士后科学基金(20060400002) 四川省青年科技基金(2007Q14-055) 国家"八六三"高技术研究发展计划项目基金(2006AA01Z230) 北京市自然科学基金(4062018)资助~~
关键词 内积空间 文本聚类 概念相似度 相似计算 数据挖掘 inner product space text clustering concept similarity similarity computing datamining
  • 相关文献

参考文献14

  • 1Pelleg D,Moore A.X-means:Extending K-means with efficient estimation of the number of clusters//Proceedings of the 17th International Conference on Machine Learning (ICML).Palo Alto,2000:727-734
  • 2Hamerly G,Elkan C.Learning the k in k-means//Proceedings of the 17th Annual Conference on Neural Information Processing Systems (NIPS).2003:281-289
  • 3Han Jia-Wei,Kamber M.Data Mining:Concepts and Techniques (2nd Edition).San Francisco:Morgan Kaufmann Publishers,2006
  • 4Corley C,Mihalcea R.Measuring the semantic similarity of texts//Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment.Ann Arbor,2005:13-18
  • 5Possas B,Ziviani N,Meira W,Ribeiro-Neto B.Set-based vector model:An efficient approach for correlation-based ranking.ACM Transactions on Information Systems,2005,23(4):397-429
  • 6Zhang Z,Otterbacher J,Radev D.Learning cross-document structural relationships using boosting//Proceedings of the 12th International Conference on Information and Knowledge Management.New Orleans,2003:124-130
  • 7Hammouda K M,Kamel M S.Efficient phrase-based document indexing for Web document clustering.IEEE Transactions on Knowledge and Data Engineering,2004,16 (10):1279-1296
  • 8Dolan W B,Quirk C,Brockett C.Unsupervised construction of large paraphrase corpora:Exploiting massively parallel news sources//Proceedings of the 20th International Conference on Computational Linguistics.Geneva,Switzerland,2004:350-356
  • 9Beil F,Ester M,Xu Xiao-Wei.Frequent term-based text clustering//Proceedings of the 8th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.Edmonton,Alberta,Canada,2002:436-442
  • 10Liu Qun,Li Su-Jian.Word similarity computing based on How-Net.Computational Linguistics and Chinese Language Processing,2002,7(2):59-76

二级参考文献21

  • 1Strzalkowski T.Natural Language Information Retrieval.The Netherlands:Kluwer Academic Publishers,1999
  • 2Sanderson M.Word sense disambiguation and information retrieval [Ph.D.dissertation].Department of Computing Science,University of Glasgow,UK,1996
  • 3Salton G.,Buckley C.Term-weighting approaches in automatic text retrieval.Information Processing & Management,1988,24(5):513~523
  • 4Church K.W.,Gale W.A.Inverse document frequency(IDF):A measure of deviations from Poisson.In:Proceedings of the 3rd Workshop on Very Large Corpora,Boston,MA,USA,1995,121~130
  • 5Singhal Amit,Buckley Chris,Mitra Mandar.Pivoted document length normalization.In:Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,Switzerland,1996,21 ~ 29
  • 6Robertson S.E.,Walker S.Okapi/Keenbow at TREC-8.In:Proceedings of the 8th Text Retrieval Conference(TREC-8),NIST Special Publication,Gaithersburg,MD,USA,1999,500~246
  • 7Deerwester S.,Dumais S.T.,Furnas G.W.,Landauer T.K.,Harshman R.Indexing by latent semantic analysis.Journal of the American Society for Information Science,1990,41(6):391~407
  • 8Hofmann Thomas.Probabilistic latent semantic indexing.In:Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval,Berkeley,California,1999,50~57
  • 9Jin Qian-Li,Zhao Jun,Xu Bo.Query expansion based on term similarity tree model.In:Proceedings of the International Conference on Nature Language Processing and Knowledge Engineering(NLPKE),Beijing,2003,400~406
  • 10Miller G.A.et al.Introduction to WordNet:An on-line lexical database.International Journal of Lexicography,1990,3(4):235~312

共引文献55

同被引文献454

引证文献44

二级引证文献509

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部