期刊文献+

基于MapReduce的数据倾斜连接算法 被引量:7

Join Algorithm in Skewed Datasets Based on MapReduce
下载PDF
导出
摘要 连接操作是大规模数据集在数据分析应用中最常用的操作,针对MapReduce自身不能有效地处理数据倾斜情况下的连接操作,提出了基于MapReduce的频次分类连接算法。根据数据在连接数据集中出现的频率将整个数据集分为3类,对倾斜数据利用分区算法和广播算法实现数据重分布,以消除数据倾斜的影响;对非倾斜数据采用Hash算法实现数据重分布。重分布后的数据在单节点内即可完成数据连接操作,避免了MapReduce框架下连接操作的跨节点传输代价;同时有效地均衡了MapReduce各节点的任务负载,从而提高了数据倾斜状态下连接操作的效率。通过与传统连接算法的对比,证明了所提算法的有效性和实用性。 Join operation is the most common operation in data analysis applications with large-scale datasets, and Map- Reduce can not support join operation perfectly in handling data skew problem. MapReduce frequecncy classified join al- gorithm was proposed, and datasets were classified into three categories according to the appeared data frequency. Data redistribution applying partitioning algorithm and broadcast algorithms eliminate the impact of skewed data. And data redistribution is realized by using hash algorithm for the non-skew data. Join operation can be completed in a single node,avoiding the cost of communications across the nodes under the MapReduce for the redistributed data,and balan- cing the workload of each node effectively, thereby improves the efficiency of join operations in skewed data. The effec- tiveness and practicality of the algorithms are proved by the comparison with traditional algorithms.
出处 《计算机科学》 CSCD 北大核心 2016年第9期27-31,共5页 Computer Science
基金 湖北省自然科学基金重点项目(2015CFA067 2013CFA115) 湖北省教育厅科研项目计划(D20151001) 武汉市科技攻关计划项目(2013012401010851)资助
关键词 数据倾斜 MAPREDUCE 连接算法 负载均衡 Data skew, MapReduce,Join algorithm, Load balancing
  • 相关文献

参考文献14

  • 1Dean J,Ghemawat S. MapReduce:Simplified data processing on large clusters [J]. Communications of the ACM, 2008, 51 (1): 107-113.
  • 2YongChul K,Magdalena B,Bill H,et al. A Study of Skew in MapReduce Applications[C]//Open Cirrus Summit. 2011.
  • 3Viswanath P, Yannis E I. Estimation of Query-Eesult Distribu- tion and its Application in Parallel-Join Load Balancing[C]// Proceedings of the 22nd VLDB Conference (PVLDB). UMum- bai(Bombay), India, 1996 : 448-459.
  • 4陈勇旭,陈梦杰,刘雪冰,宋杰.基于MapReduce的连接聚集查询算法研究[J].计算机研究与发展,2013,50(S1):306-311. 被引量:7
  • 5宋杰,李甜甜,朱志良,鲍玉斌,于戈.MapReduce连接查询的I/O代价研究[J].软件学报,2015,26(6):1438-1456. 被引量:9
  • 6Slagter K, Hsu C H, Chung Y C, et al. Smart Join: a network- aware multiway join for MapReduce[J]. Cluster Computing, 2014,17 (3) : 629-641.
  • 7Hassan M A H, Bamha M. Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model[J]. Proce- dia Computer Science, 2015,51 (1) : 70-79.
  • 8Yu X,Pekka K, et al. Handling Data Skew in Parallel Joins in Shared-Nothing Systems [C] // SIGMOD 08. Vancouver, BC, Canada, 2008 : 1043-1052.
  • 9Fariha A, Stratis D V, Salman N. SAND Join --A skew handling join algorithm for Google' s MapReduce framework[C]//IEEE 14th International Multitopic Conference(INMIC). Karachi,Pa- kistan,2011:498-509.
  • 10David J D,Jeffrey F N,Donovan A S,et al. Practical Skew Han- dling in Parallel Joins[C]//Proceedings of the 18th VLDB Con- ference (VLDB). Vancouver, British Columbia, Canada, 1992 : 27-40.

二级参考文献32

  • 1Big data: Science in the petabyte era. 2014. http://www.nature.com/nature/joumal/v455/n7209/edsumm/eO80904-Ol.html.
  • 2Directorate for Computer & Information Science & Engineering. 2014. http://www.nsf.gov/funding/pgmsumm.jsp?pims_id= 503324&org=IIS2014,2,18.
  • 3Ghemawat S, Gobioff H, Leung ST. The Google file system. In: Scott ML, Peterson LL, eds. Proc. of the 19th ACM Symp. on Operating Systems Principles. BoltonLanding: ACM Press, 2003.29-43. [doi: 10.1145/945445.945450].
  • 4HadoopTM distributed file system. 2014. http://hadoop.apache.org/docs/stablel/hdfs_design.html.
  • 5Dean J, Ghemawat S. Mapreduce: Simplified data processing on large clusters, Communication of the ACM, 2008,51 (I): 107-I 13. [doi: 10.1145/1327452.1327492].
  • 6Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian YY. A comparison of join algorithms for log processing in MapReduce. In: Elmagarmid AK, Agrawal D, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Indianapolis: ACM Press, 2010.975-986. [doi: 10.1145/1807167.1807273].
  • 7Luo G. Efficient join in Hadoop. Technical Report, NC 27705, Durham: Duke University.
  • 8Hadoop MapReduce. 2014. http://hadoop.apache.org/docs/stablel/mapred_tutorial.html.
  • 9Yang H, Dasdan A, Hsiao RL, Parker DS. Map-Reduce-Merge: Simplified relational data processing on large clusters. In: Chan CY, Ooi BC, Zhou AY, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Beijing: ACM Press, 2007. 1029-1040. [doi: 10.1145/1247480.1247602].
  • 10Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating Mapreduce for multi-core and multiproeessor systems. In: Proc. of the 13st Int'l Conf. on High-Performance Computer Architecture (HPCA-13 2007). Phoenix: IEEE Computer Society, 2007.13-24. [doi: 10.1109/HPCA.2007.346181].

共引文献24

同被引文献28

引证文献7

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部