摘要
针对已有的基于网格的离群点挖掘算法挖掘效率低和对于大数据集适应性差的问题,提出基于数据分区和网格的离群点挖掘算法。算法首先将数据进行分区,以单元为单位筛选非离群点,并把中间结果暂存起来;然后采用改进的维单元树结构维护数据点的空间信息,以微单元为单位进行非离群点筛选,并通过两个优化策略进行高效操作;最后以数据点为单位挖掘离群点,从而得到离群数据集合。理论分析和实验结果表明了该方法是有效可行的,对大数据集和高维数据具有更好的伸缩性。
To solve the problems of inefficiency and bad-adaptability for the existing outlier mining algorithms based on grid, this paper proposed an outlier mining algorithm based on data partitioning and grid. Firstly, the technology of data partitioning was applied. Secondly, the non-outliers were filtered out by cell and the intermediate results were temporarily stored. Thirdly, the structure of the improved Cell Dimension Tree (CD-Tree) was created to maintain the spatial information of the reserved data. Afterwards, the non-outliers were filtered out by micro-cell and were operated efficiently through two optimization strategies. Finally, followed by mining by data point, the outlier set was obtained. The theoretical analysis and experimental results show that the method is feasible and effective, and has better scalability for dealing with massive and high dimensional data.
出处
《计算机应用》
CSCD
北大核心
2012年第8期2193-2197,共5页
journal of Computer Applications
关键词
数据挖掘
离群数据
网格
数据分区
单元
微单元
维单元树
data mining
outlier data
grid
data partitioning
cell
micro-cell
Cell Dimension Tree (CD-Tree)