摘要
高维数据集中的不相关或冗余信息导致特征提取计算复杂度较高的问题,已成为研究者关注的热点。邻域粗糙集模型具有通过删除大规模数据中的冗余信息来提高计算效率方面的优势,为进一步提升现有邻域粗糙集模型在处理连续型高维数据库的特征提取过程中的计算效率,提出一种基于正区域和投票式属性重要度的特征提取算法。该算法首先依据属性约简前后正区域不变的性质,以及属性约简与正区域内决策划分类的类内归并和类间区分之间的本质联系,改进了投票式属性重要度计算办法;然后从域间区分、类间区分以及类内区分三个方面,融入属性粒度阈值来评估条件属性重要度,以此减少不同分布密度的条件属性给投票结果带来的距离影响;最后,通过一次性投票的方式给出所有的条件属性重要度,将条件属性重要度计算从k维降至1维,以此降低计算的复杂度。实验分析验证了新提出的算法对于提高属性重要度计算效率效果显著,在实验的7个UCI测试数据集上的分类精度以及运行时间等方面表现良好。
Irrelevant or redundant information in high-dimensional data sets leads to high computational complexity of feature extraction,which has become the research hotspot.The neighborhood rough set model has the advantage of improving computational efficiency by deleting redundant information in largescale data.In order to further improve the computational efficiency of existing neighborhood rough set models in feature extraction of continuous high-dimensional databases,we propose a feature extraction algorithm based on positive region and voting attribute importance.Firstly,since the positive region stays invariable before and after attribute reduction,and the intra-class merging and inter-class differentiation of decision-making classes in positive region is essentially related to the attribute reduction,the algorithm improves the voting attribute importance calculation method,and then incorporates an attribute granularity threshold to evaluate the importance of conditional attributes from three aspects:inter-domain differentiation,inter-class differentiation and intra-class differentiation.Thus,the distance influence of conditional attributes with different distribution densities on voting results is reduced.Finally,the importance of all conditional attributes is provided by one-time voting,and the calculation of the importance of conditional attributes is reduced from k dimensions to one dimension,thus the complexity of the calculation is decreased.Experimental analysis shows that the proposed algorithm is effective in improving the efficiency of attribute importance calculation,and is superior to the existing algorithms in terms of classification accuracy and running time on seven UCI test data sets.
作者
骆公志
张尚蕾
LUO Gongzhi;ZHANG Shanglei(School of Management,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)
出处
《南京邮电大学学报(自然科学版)》
北大核心
2024年第1期79-89,共11页
Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition
基金
国家自然科学基金(72171124)
江苏高校哲学社会科学研究重大项目(2021SJZDA129)
江苏省研究生科研创新计划(KYCX22-0884)资助项目。
关键词
邻域粗糙集
属性重要度
正区域
投票策略
特征提取
neighborhood rough set
attribute importance
positive region
voting strategy
feature extraction