摘要
针对K-最近邻(KNN)分类算法时间复杂度与训练样本数量成正比而导致的计算量大的问题以及当前大数据背景下面临的传统架构处理速度慢的问题,提出了一种基于Spark框架与聚类优化的高效KNN分类算法。该算法首先利用引入收缩因子的优化K-medoids聚类算法对训练集进行两次裁剪;然后在分类过程中迭代K值获得分类结果,并在计算过程中结合Spark计算框架对数据进行分区迭代实现并行化。实验结果表明,在不同数据集中传统尽最近邻算法、基于K-medoids的群最近邻算法所耗费时间是所提Spark框架下的B最近邻算法的3.92-31.90倍,所提算法具有较高的计算效率,相较于Hadoop平台有较好的加速比,可有效地对大数据进行分类处理。
The time complexity of K-Nearest Neighbor(KNN) classification algorithm is proportional to the number of training samples, which needs a large number of computation, and the bottleneck of slow processing exists in traditional architecture under the big data background. In order to solve the problems, a highly efficient algorithm of KNN based on Spark framework and clustering was proposed. Firstly, the training set was cut twice by the optimized K-medoids algorithm through introducing constriction factor. Then the K was iterated constantly in the process of classification and the classification result was obtained. And the data was partitioned and iterated to realize parallelization combining the Spark framework in the calculation. The experimental results show that, the classification time of the traditional KNN algorithm and the KNN algorithm based on K-medoids is 3.92 -31,90 times of the proposed algorithm in different datasets. The proposed algorithm has high computational efficiency and better speedup ratio than KNN based on Hadoop platform, and it can effectively classify the big data.
出处
《计算机应用》
CSCD
北大核心
2016年第12期3292-3297,共6页
journal of Computer Applications
基金
国家自然科学基金资助项目(61402258)
山东省本科高校教学改革研究项目(2015M102)
校级教学改革研究项目(jg05021*)~~