摘要
KNN算法是一种简单、有效且易于实现的分类算法,可用于类域较大的分类。近年来对KNN算法的研究偏向于静态大数据集,不过,在越来越多的情况下要用KNN算法在线实时处理流数据。考虑到流式数据流量大,连续且快速,不易存储和恢复等特性,以及流处理系统Storm对流数据处理具有实时性、可靠性的特点,提出了基于Storm的流数据KNN分类算法,该算法首先对整个样本集进行划分,形成多个片集,然后计算出待分类向量在各片集上的K近邻,最后再将所有片集K近邻归约得出整体K近邻,实现待分类向量的分类。实验结果表明,基于Storm的流数据KNN分类算法能够满足大数据背景下对流数据分类的高吞吐量、可扩展性、实时性和准确性的要求。
KNN(K-Nearest Neighbor)algorithm is a kind of classification algorithm which is simpler,more effective andeasier to implement.It can be applied in the classification for larger data domain.In recent years,KNN algorithm hasbeen paid great attention to study static big data sets,however,KNN algorithm has to be processed the streaming data setsonline in more and more scenarios.Considering the streaming data with the characteristics of large,continuous,fast,noteasy to store and restore;and the streaming processing system Storm with the characteristics of real-time and reliability,amodified KNN is proposed,which implements KNN on Strom to classify the streaming data online.By partitioning thewhole sample set into multiple piece sets first,it then computes KNN of those to-be-classified vectors on each pieceset,finally,the KNN are reduced to the whole KNN,thereby to achieve the classification of the to-be-classified vectors.Experiment results show that the proposed algorithm is able to meet the requirements of high throughput,scalability,real-time and accuracy for the classification of streaming data on the big data background.
作者
周志阳
冯百明
杨朋霖
温向慧
ZHOU Zhiyang;FENG Baiming;YANG Penglin;WEN Xianghui(College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China)
出处
《计算机工程与应用》
CSCD
北大核心
2017年第19期71-75,97,共6页
Computer Engineering and Applications
基金
国家自然科学基金(No.61462076
No.61662067)