期刊文献+

Hadoop平台的海量数据并行随机抽样 被引量:11

Massive data parallel random sampling based on hadoop
下载PDF
导出
摘要 在"信息爆炸"的当今社会,海量数据对数据挖掘提出新的挑战。在数据挖掘转向云计算平台实现并行化的同时,研究并行化数据随机抽样进一步降低处理的数据规模。提出一种单次扫描即可实现清理脏数据并实现等概率抽样的mapreduce并行抽样算法。在hadoop平台上实现并与普通随机抽样方法进行比较,得出其时间效率非常高,是一种行之有效的方法。为以后数据挖掘中的抽样研究和推动数据挖掘在海量数据下的发展奠定良好基础。 In today’s“information explosion”society, data mining, because of mass data, faces a new challenges. When data mining turns to cloud computing platform to realize parallel, the study of parallel data random sampling further reduces the size of the data size. This paper presents a mapreduce parallel sampling algorithm which not only can clean up dirty data but also achieves the goal of equal probability sampling. The algorithm just needs to scan processed data only one time. It runs this algorithm in the hadoop platform and compares its performance with common random sampling. As a result, this new algorithm obtains a very high time efficiency. It is a kind of effective method which lays a good founda-tion for doing research on sampling in future. It can also promote data mining in the condition of facing mass data.
作者 宛婉 周国祥
出处 《计算机工程与应用》 CSCD 2014年第20期115-118,共4页 Computer Engineering and Applications
关键词 云计算 HADOOP MAPREDUCE 并行计算 数据挖掘 随机抽样 cloud computing hadoop mapreduce parallel computing data mining random sampling
  • 相关文献

参考文献15

  • 1Dean J,Ghemawat S.MapReduce:simplified data processing on large cluster[J].Communications of the ACM,2008,51(1):107-113.
  • 2Hadoop streaming[EB/OL].[2011-12-23].http://hadoop.apache.org/common/docs/r0.15.2//streaming.html.
  • 3Hadoop T W.The definitive guide[M].[S.l.]:YAHOO!Press,2009.
  • 4李建江,崔健,王聃,严林,黄义双.MapReduce并行编程模型研究综述[J].电子学报,2011,39(11):2635-2642. 被引量:187
  • 5Langendoen K,Romein J,Bhoedjang R,et al.Integrating polling,interrupts,and thread management[C]//Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation.Los Alamitos:IEEE Computer Society,1996:13-22.
  • 6Wenisch T F,Wunderlich R E,Falsafi B,et al.Statistical sampling of microarchitecture simulation[C]//20th International Parallel and Distributed Processing Symposium,2006.
  • 7Bryan P D,Conte T M.Combining cluster sampling with single pass methods for efficient sampling regimen design[C]//25th International Conference on Computer Design,2007.
  • 8Liu Tantan,Wang Fan,Agrawal G.Stratified sampling for data mining on the deep web[C]//2010 IEEE 10th International Conference on Data Mining(ICDM),2010.
  • 9高纳德.计算机程序设计艺术第一卷[M].北京:国防工业出版社,2007.
  • 10谢桂兰,罗省贤.基于Hadoop MapReduce模型的应用研究[J].微型机与应用,2010,29(8):4-7. 被引量:69

二级参考文献59

共引文献378

同被引文献108

引证文献11

二级引证文献67

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部