摘要
由于局部离群点被密度相似的正常点掩盖,不易被隔离,使得扩展的隔离森林算法(EIF)对这类离群点的识别效果不理想。针对此问题,提出基于相对比重的扩展隔离森林算法(Relative Proportion-Extended Isolation Forest,RP-EIF)。该算法仍然基于随机斜度和随机截距划分超平面,生成隔离森林,但根据预测样本落入的叶子节点与其父节点的相对比重计算离群分数排名,而不使用基于路径长度的排名。把全局排名替换为考虑邻域数据分布局部排名增强了算法对局部离群点的敏感性,同时还减少了算法的时间复杂度。在离群点检测数据库(ODDS)的5个公开数据集上验证RP-EIF算法的有效性和算法效率,并与EIF算法、GIF算法、iForest算法、COPOD算法、LOF算法进行了对比。实验表明:RP-EIF算法在5个ODDS公开数据集上的准确率高出EIF算法1至4百分点,高出其他5个算法2至38百分点。而且在5个数据集上的时间消耗RP-EIF算法要比EIF算法减少约30%。
Since local outliers are covered by normal points with similar density,they are not easy to be isolated,so the extended isolation algorithm(EIF)is not effective in identifying such outliers.To solve this problem,an Relative Proportion-Extended Isolation Forest algorithm is proposed.The algorithm still divides the hyperplane based on random slopes and random intercepts to generate isolation forests,but ranks the outlier score based on the relative proportions of the leaf nodes that the predicted samples are fallen into with their parent nodes,rather than the path length-based ranking.Replacing the global ranking with local ranking considering the neighborhood data distribution enhances the algorithm’s sensitivity to local outliers and reduces the algorithm’s time complexity.The effectiveness and algorithm efficiency of the RP-EIF algorithm are tried on 5 public datasets in the Outlier Detection Databases(ODDS).Compared with EIF algorithm,GIF algorithm,iForest algorithm,COPOD algorithm,LOF algorithm,the accuracy of the RP-EIF algorithm on the 5 ODDS public datasets is 1 to 4 percentage points higher than the EIF algorithm,and 2 to 38 percentage points higher than the other 5 algorithms.Moreover,the time consumption of the RP-EIF algorithm on the 5 datasets is about 30%less than that of the EIF algorithm.
作者
刘俊成
董东
LIU Jun-cheng;DONG Dong(School of Computer and Cyber Security,Hebei Normal University,Shijiazhuang 050024,China)
出处
《计算机技术与发展》
2023年第6期16-21,共6页
Computer Technology and Development
基金
教育部教育考试院“十四五”规划支撑专项课题(NEEA2021064)。
关键词
大数据挖掘
离群点检测
局部离群点
扩展的隔离森林算法
相对比重
big data mining
outlier detection
local outliers
extended isolated forest algorithm
relative proportion