期刊文献+

基于Spark框架的高效KNN中文文本分类算法 被引量:19

Highly efficient Chinese text classification algorithm of KNN based on Spark framework
下载PDF
导出
摘要 针对K-最近邻(KNN)分类算法时间复杂度与训练样本数量成正比而导致的计算量大的问题以及当前大数据背景下面临的传统架构处理速度慢的问题,提出了一种基于Spark框架与聚类优化的高效KNN分类算法。该算法首先利用引入收缩因子的优化K-medoids聚类算法对训练集进行两次裁剪;然后在分类过程中迭代K值获得分类结果,并在计算过程中结合Spark计算框架对数据进行分区迭代实现并行化。实验结果表明,在不同数据集中传统尽最近邻算法、基于K-medoids的群最近邻算法所耗费时间是所提Spark框架下的B最近邻算法的3.92-31.90倍,所提算法具有较高的计算效率,相较于Hadoop平台有较好的加速比,可有效地对大数据进行分类处理。 The time complexity of K-Nearest Neighbor(KNN) classification algorithm is proportional to the number of training samples, which needs a large number of computation, and the bottleneck of slow processing exists in traditional architecture under the big data background. In order to solve the problems, a highly efficient algorithm of KNN based on Spark framework and clustering was proposed. Firstly, the training set was cut twice by the optimized K-medoids algorithm through introducing constriction factor. Then the K was iterated constantly in the process of classification and the classification result was obtained. And the data was partitioned and iterated to realize parallelization combining the Spark framework in the calculation. The experimental results show that, the classification time of the traditional KNN algorithm and the KNN algorithm based on K-medoids is 3.92 -31,90 times of the proposed algorithm in different datasets. The proposed algorithm has high computational efficiency and better speedup ratio than KNN based on Hadoop platform, and it can effectively classify the big data.
出处 《计算机应用》 CSCD 北大核心 2016年第12期3292-3297,共6页 journal of Computer Applications
基金 国家自然科学基金资助项目(61402258) 山东省本科高校教学改革研究项目(2015M102) 校级教学改革研究项目(jg05021*)~~
关键词 K-最近邻 聚类 收缩因子 K-medoids SPARK 并行化计算 K-Nearest Neighbor(KNN) clustering constriction factor K-medoids Spark parallel computing
  • 相关文献

参考文献3

二级参考文献26

  • 1王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93. 被引量:20
  • 2李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 3HAN Jia-wei,KAMBER M.数据挖掘概念与技术[M].2版.北京:机械工业出版社,2008:263-266.
  • 4CHEN Xin-quan,PENG Hong,HU Jing-song.K-medoids substitution clustering method and a new clustering validity index method[C] //Proc of the 6th World Congress on Intelligent Control and Automation.2006:5896-5900.
  • 5HE Zeng-you.Farthest-point heuristic based initialization methods for K-modes clustering[EB/OL].(2006-10-10).http://arxiv.org/ftp/cs/papers/0610/0610043.pdf.
  • 6PARK H S,JUN C H.A simple and fast algorithm for K-medoids clustering[J].Expert Systems with Applications,2009,36(2):3336-3341.
  • 7PARDESHI B,TOSHNIWAL D.Improved K-medoids clustering based on cluster validity index and object density[C] //Proc of the 2nd IEEE International Advance Computing Conference.2010:379-384.
  • 8GAO Dan-yang,YANG Bing-ru.An improved K-medoids clustering algorithm[C] //Proc of the 2nd International Conference on Computer and Automation Engineering(ICCAE).2010:132-135.
  • 9BARIONI C N M,RAZENTE H L,TRAINA A J M,et al.Accelerating K-medoid-based algorithms through metric access methods[J].The Journal of Systems and Software,2008,81(3):343-355.
  • 10PARTYKA J,KHAN L,THURAISINGHAM B.Semantic schema matching without shared instances[C] //Proc of IEEE International Conference on Semantic Computing.2009:297-302.

共引文献454

同被引文献154

引证文献19

二级引证文献58

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部