摘要
离群点检测任务是指检测与正常数据在特征属性上存在显著差异的异常数据。大多数基于聚类的离群点检测方法主要从全局角度对数据集中的离群点进行检测,而对局部离群点的检测性能较弱。基于此,本文通过引入快速搜索和发现密度峰值方法改进K-means聚类算法,提出了一种名为KLOD(local outlier detection based on improved K-means and least-squares methods)的局部离群点检测方法,以实现对局部离群点的精确检测。首先,利用快速搜索和发现密度峰值方法计算数据点的局部密度和相对距离,并将二者相乘得到γ值。其次,将γ值降序排序,利用肘部法则选择γ值最大的k个数据点作为K-means聚类算法的初始聚类中心。然后,通过K-means聚类算法将数据集聚类成k个簇,计算数据点在每个维度上的目标函数值并进行升序排列。接着,确定数据点的每个维度的离散程度并选择适当的拟合函数和拟合点,通过最小二乘法对升序排列的每个簇的每1维目标函数值进行函数拟合并求导,以获取变化率。最后,结合信息熵,将每个数据点的每个维度目标函数值乘以相应的变化率进行加权,得到最终的异常得分,并将异常值得分较高的top-n个数据点视为离群点。通过人工数据集和UCI数据集,对KLOD、LOF和KNN方法在准确度上进行仿真实验对比。结果表明KLOD方法相较于KNN和LOF方法具有更高的准确度。本文提出的KLOD方法能够有效改善K-means聚类算法的聚类效果,并且在局部离群点检测方面具有较好的精度和性能。
Objective Outliers are defined as data points generated for various special reasons.They are often regarded as noise points due to their deviation from normal data points and are considered points of research value,occupying a small proportion of the dataset.The task of outlier detection involves identifying these points and analyzing their potential abnormal information through the analysis of data attribute features.This process aims to uncover unusual patterns or behaviors within the dataset that can provide insights into unique phenomena or anomalies.Most clusteringbased outlier detection methods primarily detect outliers in the dataset from a global perspective,with weaker performance in detecting local outliers.Hence,an improved K-means clustering algorithm is proposed by introducing fast search and discovering density peak methods.A local outlier detection method,named KLOD(local outlier detection based on improved K-means and least squares methods),is developed to achieve precise detection of local outliers.Methods The K-means clustering algorithm is characterized by hard clustering,meaning that after clustering the dataset,each data point has a clear association with one cluster or another.This property makes it suitable for outlier detection,as outliers significantly affect the clustering process.However,selecting initial cluster centers and determining the number of clusters is crucial as they directly impact the clustering effectiveness.To select the accurate cluster center,clustering by fast search and finding density peaks is utilized to compute the local density and relative distance of data points,constructing a decision graph based on these metrics.The challenge lies in accurately determining the cutoff distance dc,making it difficult to precisely identify the number of cluster centers from the decision graph obtained using a single dc value.The elbow method is employed to determine the optimal number of clusters for an unknown dataset for the best clustering effectiveness to address the challenge of determining the number of clusters.When clustering data into different numbers of clusters,the cost function value changes accordingly.The number of clusters is depicted on the x-axis,and the cost function value is on the y-axis.The changes in the cost function value with the number of clusters are recorded and plotted as a line graph.When there is no significant decrease in the cost function value with an increase in the number of cluster centers,the position of the“elbow”is observed to determine the optimal number of clusters.After determining the initial cluster centers and the number of clusters k,the dataset is clustered using the K-means clustering algorithm to obtain k clusters and their corresponding cluster centers.The objective function value for each data point in each dimension within each cluster is then computed.Then,the objective function values for each dimension of the data points in each cluster are sorted in ascending order.The objective function values,sorted in ascending order,are fitted using the least squares approach to obtain a curve.The derivative of this fitted curve is then calculated to obtain the slope,providing insight into the rate of change of the objective function values within each cluster.Each dimension’s degree of dispersion and information content can vary in the dataset,so different weights are assigned to each dimension.Information entropy is employed to measure the dataset’s degree of dispersion,and higher weightage is given to dimensions with higher outlier degrees to represent their impact on the overall dataset.By incorporating information entropy,each dimension’s objective function value for each data point is weighted by the corresponding change rate.This process results in the final anomaly score,and the top-n data points with high anomaly scores are considered outliers.Results and Discussions The experimental results indicated that in the artificial dataset,KLOD,KNN,and LOF all detect sparse local outliers effectively.However,the LOF algorithm struggles to detect outliers within outlier clusters.Additionally,the KNN method cannot detect local outliers within densely distributed clusters when there is a considerable distance between normal data points.In contrast,the KLOD method analyzes each cluster individually,addressing the issue of uneven cluster densities.The KLOD method analyzes each dimension of the data points within each cluster separately,achieving accurate detection.In the UCI dataset,the KLOD method achieves optimal detection accuracy in 10 datasets,with detection accuracy on par with KNN and LOF in 2 datasets.Compared to the KNN and LOF algorithms,KLOD also demonstrates high accuracy in outlier detection.The fast search density peak method is applied to calculate the local density and relative distance of data points,and theγvalue of each data point is determined based on these two metrics to improve the K-means clustering algorithm.However,the size ofγis influenced by the cutoff distance dc,making it difficult to intuitively choose k initial cluster centers.Hence,the elbow method selects the k data points with the largestγvalues as initial cluster centers for the K-means clustering algorithm.Least squares fitting is employed to fit the objective function values for each dimension sorted in ascending order.This method highlights the degree of outlierness of outliers,incorporating more outlier information into the final anomaly score.Conclusions Experimental results on artificial and UCI real datasets demonstrated that the KLOD method can detect local outliers with moderate outliers.Compared to the KNN and LOF methods,it significantly improves detection accuracy.However,due to limitations of the K-means algorithm itself,its clustering performance is poor for datasets containing arbitrarily shaped clusters,affecting detection performance.Therefore,future studies can focus on enhancing the performance of outlier detection methods on datasets with arbitrary cluster shapes.
作者
周玉
夏浩
岳学震
王培崇
ZHOU Yu;XIA Hao;YUE Xuezhen;WANG Peichong(School of Electrical Eng.,North China Univ.of Water Resources and Electric Power,Zhengzhou 450045,China;School of Info.Eng.,Hebei Univ.of Geosciences,Shijiazhuang 050031,China)
出处
《工程科学与技术》
EI
CAS
CSCD
北大核心
2024年第4期66-77,共12页
Advanced Engineering Sciences
基金
国家自然科学基金项目(U1504622,31671580)
河南省高等学校青年骨干教师培养计划项目(2018GGJS079)
河北省高等学校科学技术研究项目(ZD2020344)。