摘要
结合Rough Set理论研究了分布式处理海量数据中的关键问题,即分割海量数据集的问题。经典的Rough Set算法要求数据常驻内存,因此不能有效地处理海量数据。为了能够直接处理海量数据集,根据最佳分割的定义,结合属性约简的思想,提出基于属性约简的粗糙集海量数据分割算法(Mass Data Partition for Rough Set on Attribute Reduction,MD-PRS-AR)。通过实验表明,MDPRS-AR算法的分割效率比传统的算法约高70%,而且与处理整个数据集的算法相比,正确性损失不大。
An effective rough-set-based method is developed to study the key problem of process distributed mass data, which is the problem of segment massive dataset. Most other rough- set - based algorithms are designed only for memory- resident data, so it is hard for these algorithms to deal with mass data set. On the base of definition of best partition, and combined with the idea of attribute reduction, a mass data partition for rough set on attribute reduction algorithm is developed for processing mass data sets directly. It is proved by simulation experiments that the MDPRS- AR method presented is faster than original rough- set- based algorithms by about 70%, while its performance is close to those algorithms that process the original data set as a whole.
出处
《计算机技术与发展》
2010年第4期5-7,11,共4页
Computer Technology and Development
基金
国家自然科学基金(60973139
60773041)
江苏省自然科学基金(BK2008451)
国家高科技863项目(2007AA01Z404
2007AA01Z478)
现代通信国家重点实验室基金(9140C1105040805)
国家和江苏省博士后基金(0801019C
20090451240
20090451241)
江苏高校科技创新计划项目(CX08B-086Z)
江苏省六大高峰人才项目(2008118)
江苏省青蓝工程资助项目
关键词
海量数据
粗糙集
数据分割
分布式处理
属性约简
mass data
rough set
data partition
distributed information procession
attribute reduction