摘要
时序数据中的野值会直接影响数据挖掘算法的结果,甚至造成算法失效。传统的基于密度的带有噪声的空间聚类(DBSCAN)算法可以用来识别野值,但是却存在算法对参数敏感、时间复杂度高、精度不高等问题。针对时序数据的特点,提出了一种可自动进行多次识别的基于方差聚类的野值识别算法。该方法通过将传统的邻域密度转换为方差和均值、将密度阈值转换为时间窗口内的方差和阈值,在定义野值数据、野簇数据和异常簇数据的基础上,给出野值识别方法的判断规则。同时,针对一次野值识别不能将全部野值剔除的问题,通过定义多次野值识别的结束条件将算法扩展为多次野值识别算法。通过在某航天数据挖掘项目中的应用,验证了该算法具有较好的通用性、低的时间复杂度、可进行多次识别以提高精度等特点。
Outliers in time series data will directly affect the results in data mining, even make the algorithm inefficacious. Traditional Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm can be used in outlier identification; however, there are several deficiencies such as sensitive to parameters, higher time complexity and less accuracy. Considering the characteristics of time series data, an outlier identification algorithm based on variance clustering was proposed. By converting neighborhood density into variance and mean value, converting density threshold into variance and threshold of a time window, based on the definition of outlier data, outlier cluster data and abnormal data, the outlier identifieation rides were given. For applying the algorithm once will probably not eliminate all the outliers, it is expanded to a multiple identification algorithm by defining the termination condition. This algorithm was verified its generality, less time complexity and higher accurate by being applied to a space data mining system.
出处
《计算机应用》
CSCD
北大核心
2012年第A02期22-25,共4页
journal of Computer Applications
关键词
时序数据
野值识别
聚类挖掘
DBSCAN算法
time series data
outlier identification
clustering data mining
DBSCAN algorithm