期刊文献+

一种基于MapReduce的文本聚类方法研究 被引量:6

Text Clustering Method Study Based on MapReduce
下载PDF
导出
摘要 在文本聚类中,相似性度量是影响聚类效果的重要因素。常用的相似性度量测度,如欧氏距离、相关系数等,只能描述文本间的低阶相关性,而文本间的关系非常复杂,基于低阶相关测度的聚类效果不太理想。一些基于复杂测度的文本聚类方法已被提出,但随着数据规模的扩展,文本聚类的计算量不断增加,传统的聚类方法已不适用于大规模文本聚类。针对上述问题,提出一种基于MapReduce的分布式聚类方法,该方法对传统K-means算法进行了改进,采用了基于信息损失量的相似性度量。为进一步提高聚类的效率,将该方法与基于MapReduce的主成分分析方法相结合,以降低文本特征向量的维数。实例分析表明,提出的大规模文本聚类方法的聚类性能比已有的聚类方法更好。 Text clustering is the key technology of text organization,information extraction and topic retrieval.Appropriate similarity measure selection is an important task of clustering,which has great affection on the clustering results.Classical similarity measures,such as distance function and the correlation coefficient,can only describe the linear relationship between documents.However,clustering results based on classical clustering methods are usually unsatisfactory due to the complicated relationship among text documents.Some complicated clustering methods have been studied.But,with the growing scale of text data,the computational cost increases markedly with the increase of dataset size.Classical clustering methods are out of work in dealing with large scale dataset clustering problems.In this paper,a distributed clustering method based on MapReduce was proposed to deal with large scale text clustering.Furthermore,we proposed an improved version of k-means algorithm,which utilizes information loss as the similarity function.For improving clustering speed,parallel PCA method based on MapReduce was used to reduce the document vector dimension.The experimental results demonstrate that the proposed method is more efficient for text clustering than classic clustering methods.
出处 《计算机科学》 CSCD 北大核心 2016年第1期246-250,269,共6页 Computer Science
基金 国家自然科学基金项目(61472230) 山东省科技发展计划(2013GZC20102)资助
关键词 文本聚类 MAPREDUCE K-MEANS 信息损失 Text clustering MapReduce K-means Information loss
  • 相关文献

参考文献15

  • 1Zhang Ren-yuan, Shibata T. An analog on-line-learning K-means processor employing fully parallel self-converging cireuitry[J]. Analog Integrated Circuits and Signal Processing, 2013,75 (2): 267-277.
  • 2Sathiyakumari K, Preamsudha V, Manimekalai G, et al. A Sur- vey on Various Approaches in Document Clustering [J]. Inter- national Journal of Computer Technology and Applications, 2011,2(5) : 1534-1539.
  • 3向小军,高阳,商琳,杨育彬.基于Hadoop平台的海量文本分类的并行化[J].计算机科学,2011,38(10):184-188. 被引量:35
  • 4Kannungo T, Mount D M, Netanyahu N S, et al. An Efficient K- Means Clustering Algorithm: Analysis And Implementation[J]. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2002,24(7) : 881-891.
  • 5Wang Da, Mazumdar A, Womell G W. A Rate-Distortion Theory For Permutation Spaces[C]//IEEE International Symposium on Information Theory Proceedings. 2013:2562-2566.
  • 6Sun Zhan-quan,Geoffrey F, Gu Wei-dong, et al. A parallel clus- tering method combined information bottleneck theory and cen- troid-based clustering [J]. The Journal of Supercomputing, 2014,69 (1) .. 452-467.
  • 7Lu Shi-jian, Chen Tao, Tian Shang-xuan, et al. Scene text ex- traction based on edges and support vector regression[J]. Inter- national Journal on Document Analysis and Recognition, 2015, 18(2) : 125-135.
  • 8Bellot P,Bonnefoy L,Bouvier V,et al. Large Scale Text Mining Approaches for Information Retrieval and Extraction[M]//In- novations in Intelligent Machines. 2014:3-45.
  • 9朱烨行,李艳玲,崔梦天,杨献文.一种改进K-means算法的聚类算法CARDBK[J].计算机科学,2015,42(3):201-205. 被引量:12
  • 10Brecheisen S, Krieegel H P, Kroger P, et al. Visually mining through cluster hierarchies [C] ff International Conference on Data Mining. Lake Buena Vista, FL, 2004 : 400-412.

二级参考文献52

  • 1高洁,吉根林.文本分类技术研究[J].计算机应用研究,2004,21(7):28-30. 被引量:36
  • 2刘泉凤,陆蓓,王小华.文本挖掘中聚类算法的比较研究[J].计算机时代,2005(6):7-8. 被引量:8
  • 3张惟皎,刘春煌,李芳玉.聚类质量的评价方法[J].计算机工程,2005,31(20):10-12. 被引量:60
  • 4Sebastiani F. Text Categorization[Z]. Encyclopedia of Database Technologies and Applications. 2005..683-687.
  • 5Joachims T. A Probabilistic Analysis of the Rocchio Algorithm with TF1DF for Text Categorization[C]//Proceedings of the Fourteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 1997.
  • 6Yang Y. An Evaluation of Statistical Approaches to Text Categorization[J]. Journal of Information Retrieval, 1999, 1 (1/2) :67-88.
  • 7Rocchio J J Jr. Relevance Feedback in Information Retrieval [M]. Salton G, ed. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Inc. , Englewood Cliffs, New Jersey, 1971 : 313-323.
  • 8Tzeras K, Hartmann S. Automatic Indexing Based on Bayesian Inference Networks[C]//Proc. 16th ACM Int. SIGIR Conference. 1993: 22-34.
  • 9Masand B, Lino G, Waltz D. Classifying News Stories Using Memory Based Reasoning[C]//15th ACM SIGIR Conference. 1992:59-65.
  • 10Apte C, Damerau F, Weiss S. Automated Learning of Decision Rules for Text Categorization[J]. ACM Trans. on Information Systems, 1994,12(3) : 233-251.

共引文献130

同被引文献42

引证文献6

二级引证文献42

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部