摘要
为处理大数据量决策表的离散化问题,设计高效的离散化算法是必要的.根据候选断点在单属性上重要性值的分布规律,提出了"先动态聚类,再选择候选断点"的思路和基于Rough集的快速离散化算法.首先,根据断点的重要性在单个特征上的分布规律,对断点进行快速动态聚类,从而有效降低候选断点的数目;然后,在聚类结果的基础上,采用启发式方法快速选择并得到最终的断点集,从而实现决策表的离散化.试验结果表明:通过动态聚类,多数数据集候选断点的数目能减少80%以上,大大提高了后续断点选择的效率;用提出的算法处理7个UCI数据集Iris、Wine、Glass、Ecoli、Breast_w、Pima和Letter,其正确识别率分别约为92.0%、92.1%、69.3%、65.7%、95.3%、67.1%和76.5%.
In order to process the discretization of a decision table with large quantity objects,it is necessary to develop a high efficient discretization algorithm.The distribution of the importance values of candidate cuts on single attribute in a decision table was analyzed,and based on the distribution,a two-step solution procedure and a high efficient discretizaiton algorithm based on the rough set theory were proposed.Firstly,the candidate cuts are dynamically clustered in the light of their importance,so the number of the candidate cuts will decrease.Secondly,the final result cuts will be selected quickly from the clustered cuts using the heuristic method,as a result,the discretizaion of the decision table can be implemented by the final result cuts.The experiment results show that after dynamic clustering,the number of candidate cuts in most of data sets can be decreased by more than 80% to raise the efficiency of next cut selection greatly.To seven UCI data sets,Iris,Wine,Glass,Ecoli,Breast_w,Pima and Letter,in the experiments,their recognition rates are about 92.0%,92.1%,69.3%,65.7%,95.3%,67.1% and 76.5% respectively using the proposed algorithm.
出处
《西南交通大学学报》
EI
CSCD
北大核心
2010年第6期977-983,共7页
Journal of Southwest Jiaotong University
基金
国家自然科学基金资助项目(60573068
60773113)
重庆市重点自然科学基金资助项目(2008BA2017)
重庆市杰出青年基金资助项目(2008BA2041)
重庆市教育委员会科学技术研究项目(KJ090512)
关键词
粗集
决策表
离散化
聚类
rough set
decision table
discretization
clustering