摘要
连接操作是大规模数据集在数据分析应用中最常用的操作,针对MapReduce自身不能有效地处理数据倾斜情况下的连接操作,提出了基于MapReduce的频次分类连接算法。根据数据在连接数据集中出现的频率将整个数据集分为3类,对倾斜数据利用分区算法和广播算法实现数据重分布,以消除数据倾斜的影响;对非倾斜数据采用Hash算法实现数据重分布。重分布后的数据在单节点内即可完成数据连接操作,避免了MapReduce框架下连接操作的跨节点传输代价;同时有效地均衡了MapReduce各节点的任务负载,从而提高了数据倾斜状态下连接操作的效率。通过与传统连接算法的对比,证明了所提算法的有效性和实用性。
Join operation is the most common operation in data analysis applications with large-scale datasets, and Map- Reduce can not support join operation perfectly in handling data skew problem. MapReduce frequecncy classified join al- gorithm was proposed, and datasets were classified into three categories according to the appeared data frequency. Data redistribution applying partitioning algorithm and broadcast algorithms eliminate the impact of skewed data. And data redistribution is realized by using hash algorithm for the non-skew data. Join operation can be completed in a single node,avoiding the cost of communications across the nodes under the MapReduce for the redistributed data,and balan- cing the workload of each node effectively, thereby improves the efficiency of join operations in skewed data. The effec- tiveness and practicality of the algorithms are proved by the comparison with traditional algorithms.
出处
《计算机科学》
CSCD
北大核心
2016年第9期27-31,共5页
Computer Science
基金
湖北省自然科学基金重点项目(2015CFA067
2013CFA115)
湖北省教育厅科研项目计划(D20151001)
武汉市科技攻关计划项目(2013012401010851)资助