摘要
针对基于Rough集的经典分类算法值约简算法等不适合大数据集的问题,提出了基于Rough集的决策树算法.采用一个新的选择属性的测度——属性分类粗糙度作为选择属性的启发式,该测度较Rough中刻画属性相关性的测度正区域等更为全面地刻画了属性分类综合贡献能力,并且比信息增益和信息增益率的计算更为简单采取了一种新的剪枝方法——预剪枝,即在选择属性计算前基于变精度正区域修正属性对数据的初始划分模式, 以更有效地消除噪音数据对选择属性和生成叶节点的影响.采取了一种与决策树算法高度融合的简单有效的检测和处理不相容数据的方法,从而使算法对相容和不相容数据都能进行有效处理.对UCI机器学习数据库中几个数据集的挖掘结果表明,该算法生成的决策树较ID3算法小,与用信息增益率作为启发式的决策树算法生成的决策树规模相当.算法生成所有叶节点均满足给定最小置信度和支持度的决策树或分类规则,并易于利用数据库技术实现,适合大数据集.
For the problem that classical classification algorithms such as value reduction algorithm based on Rough set are not suitable for large data sets, this paper proposes a decision tree algorithm based on Rough set. The algorithm takes a novel measure--attribute classification rough degree as the heuristic of choosing attribute at a tree node, which more synthetically measures contribution of an attribute for classification than other measures in Rough set and is simpler in calculation than information gain and information gain ratio. The algorithm adopts a new pruning method,predictive pruning, which makes use of variable precision positive a^as to revise the partition pattern of attribute to the data set at a tree node before the calculation of choosing attribute, thus more effectively eliminating the effect of noise data on choosing attributes and generating leaf nodes. The algorithm takes a simple and efficient method to deal with inconsistent data, which is highly merged with decision tree algorithm, hence it can deal with both consistent and inconsistent data efficiently. The mining results of 6 data sets of UCI machine learning repository- show that the size of trees generated by the algorithm is smaller than that by ID3, and is at the same scale as that generated by the decision tree algorithm using information gain ratio as heuristic. The algorithm can directly generate decision trees or classification rule sets and is easy to realize by database technology, which makes it suitable for large data sets.
出处
《天津大学学报(自然科学与工程技术版)》
EI
CAS
CSCD
北大核心
2005年第9期842-846,共5页
Journal of Tianjin University:Science and Technology
基金
天津市教委高校科技发展基金资助项目(020714)天津理工大学科技发展基金研究资助项目(LG030291)
关键词
ROUGH集
决策树
属性分类粗糙度
预剪枝
不相容数据
Rough set
decision tree
attribute classification rough degree
predictive pruning
inconsistent data