摘要
d维点集离群数据挖掘技术是目前数据挖掘领域的研究热点之一。当前基于距离或最近邻概念进行离群数据挖掘时,在高维数据情况下的挖掘效果不佳,鉴于此,将基于角度的离群因子应用到高维离群数据挖掘中,提出一种新的基于随机投影算法的离群数据挖掘方案,它只需要用接近线性时间的方法就能预测所有数据点的基于角度的离群因子。该方法可以用于并行环境进行并行加速。对近似质量进行了理论分析,以保证算法的可靠性。合成和真实数据集实验结果表明,对超高维数据集,该方法效率高、可伸缩性强。
Outlier mining in ddimensional point sets is currently one of the hot areas of data mining. The current outlier mining approaches based on the distance or the nearest neighbor result in the poor mining results. To solve this problem, this paper investi gates the use of anglebased outlier factor in mining high dimensional outliers. It proposes a novel random projectionbased tech nique that is able to estimate the anglebased outlier factor for all data points in time nearlinear in the size of the data. Also, the approach is suitable to be performed in parallel environment to achieve a parallel speedup. It introduces a theoretical analysis of the quality of approximation to guarantee the reliability of the algorithm. The empirical experiments on synthetic and real world data sets demonstrate that the approach is efficient and scalable to very large high-dimensional data sets.
出处
《计算机工程与应用》
CSCD
2013年第24期122-129,共8页
Computer Engineering and Applications
基金
2011年湖南省教育厅科学研究项目(No.11C0784)
关键词
离群数据挖掘
角度
随机投影算法
接近线性时间
可靠性
效率
outlier data mining
angle
random projection algorithm
near-linear time
reliability
efficiency