基于Impala的大数据查询分析计算性能研究被引量：12

Research on performance of big data computing and query processing based on Impala

下载PDF

导出

摘要分析了Cloudera公司推出的Impala实时查询引擎原理与架构,并深入比较Impala与传统MapReduce的性能与特点,针对Impala进行复杂大数据处理方面的不足,提出了MapReduce与Impala结合的大数据处理方法,通过使用MapReduce对Impala的输入数据进行预处理,利用MapReduce在复杂作业处理方面的长处弥补了Impala在这方面的不足。最后对电信手机上网日志进行大数据查询和分析计算实验,实验结果表明,在大数据查询性能方面,基于MapReduce与Impala结合的大数据处理速度比传统MapReduce快了一倍。特别地,在迭代查询实验中,基于MapReduce与Impala结合的处理方法超过传统MapReduce方法八倍以上。基于MapReduce与Impala结合的处理方法在单次查询中的效率仍然高于传统MapReduce;而在迭代查询中,MapReduce与Impala结合的处理方法远远地超过了MapReduce。因此,MapReduce与Impala结合的处理方法能够发挥Impala和Hadoop各自的优点,让处理效率远超传统MapReduce,对于复杂的大数据处理的能力高于Impala。 First of all,this paper analyzed the elements and architecture of Impala the big data real-time query engine released by Cloudera recently. Then it compared the feature and efficiency between traditional MapReduce and Impala. Based on the comparison,it discovered the disadvantages of Impala. After that it proposed a method to process data with both MapReduce and Impala： using MapReduce to preprocess incoming data of Impala. This method utilized the flexibility of MapReduce to cover the disadvantages of Impala. Comparative experiments on the access log generated by China telecom＇s daily wap traffice have proved that Impala is evidently faster than traditional MapReduce and the combination of MapReduce and Impala will run twice faster than the traditional MapReduce. Especially in iterative analysis,the combination of MapReduce and Impala shows its overwhelming superiority towards traditional MapReduce. Hence,it is concluded that the combination of MapReduce and Impala can adopt the advantage of each other. It outperform traditional MapReduce on Performance and Impala on flexibility on complex data processing.

作者郭超刘波林伟伟

机构地区华南师范大学计算机学院华南理工大学计算机科学与工程学院

出处《计算机应用研究》 CSCD 北大核心 2015年第5期1330-1334,共5页 Application Research of Computers

基金国家自然科学基金资助项目(61070015) 广东省自然科学基金资助项目(S2011010001754 S2012030006242) 广东省科技计划资助项目(2012B010100030)

关键词大数据 HADOOP MAPREDUCE IMPALA 计算性能查询分析 big data Hadoop MapReduce Impala calculated performance query analysis

分类号 TP391 [自动化与计算机技术—计算机应用技术] TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献13

1颜开. 新一代数据分析利器:Google Dremel原理分析[R].2012.
2MELNIK S,GUBAREV A,LONG Jing-jing,et al. Dremel:interactive analysis of Web-scale datasets[J].Proceedings of the VLDB Endowment,2010,3(1):330-339.
3Cloudera Company. CDH4和Impala文档[EB/OL].http://www. cloudera. com/content/support /en/documentation. html.
4Cloudera Impala:Real-time queries in apache Hadoop,for real[EB/OL].(2012-10). http://blog. cloudera. com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/.
5Apache Hadoop[EB/OL].http://hadoop. apache. org.
6Apache Hive[EB/OL].http://hive. apache. org/.
7DEAN J,GHEMAWAT S. MapReduce:simplified data processing on large clusters[C] //Proc of the 6th Symposium on Operating Systems Design and Implementation. 2004.
8DITTRICH J,RICHTER S,SCHUH S. Efficient OR Hadoop:why not both?[J].Datenbank Spektrum,2013,13(1):17-22.
9HDFS architecture guide[EB/OL].(2013-08-04). http:// hadoop. apache. org/docs/ r1. 2. 1/hdfs_de-sign. html.
10Intel. Optimizing Hadoop deployments[EB/OL].(2010-05-23). http://communities. intel. com/ servlet/JiveServletdownloadBody/5645-102-1-8759.

同被引文献111

1蒋春平,黄煜骁,周晓君.基于Kudu的实时业务应用场景解决方案[J].电信科学,2020,36(S01):268-275. 被引量：3
2苟素洁.浅谈信息系统统一身份认证和单点登录[J].工业计量,2012,22(S2):61-63. 被引量：2
3郭朝鹏,王智,韩峰,张一川,宋杰.HaoLap:基于Hadoop的海量数据OLAP系统[J].计算机研究与发展,2013,50(S1):378-383. 被引量：5
4张钢.GSM网络中越区覆盖的分析[J].中国无线电,2005(12):17-18. 被引量：3
5Leibiusky J,Eisbruch G,Simonassi D.Getting Started With Storm. Journal of Women s Health . 2012
6Ashish Thusoo,Joydeep Sen Sarma,Namit Jain,Zheng Shao,Prasad Chakka,Suresh Anthony,Hao Liu,Pete Wyckoff,Raghotham Murthy.Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment . 2009
7Salakhutdinov R, Hinton G E. Deep boltzmann machines [C]l1 Proceedings of the 12th conference on Artificial Intelligence and Statistics, Clearwater, FL, USA, 2009: 448-455.
8Zhang Y, Salakhutdinov R, Chang H A, et al.Resource configurable spoken query detection using deep Boltzmann machines [C ]11 Proceedings of 2012 conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, 2012: 5161-5164.
9Ryan D P, Daley B J, Wong K, et al.Prediction of ICU in-hospital mortality using a deep Boltzmann machine and dropout neural net [C] /1 Proceedings of 2013 conference on Biomedical Sciences and Engineering, Oak Ridge, TN, USA, 2013: 211-216.
10Srivastava N, Salakhutdinov R R, Hinton G E.Modeling documents with deep boltzmann machines [C]// Proceedings of the 29th conference on Uncertainty in Artificial Intelligence, Bellevue, W A, USA, 2013: 222-227.

引证文献12

1刘凯,张立民,范晓磊,孙永威.基于改进Hadoop的受限玻尔兹曼机云计算实现[J].燕山大学学报,2015,39(2):145-151.
2任春华,廖雪花,唐思娩,严余松,孙尉筌.基于Hive和Impala的物流运力供应链数据仓库研究[J].物流技术,2016,35(1):124-126. 被引量：3
3袁文翠,舒昝,赵建民.基于MTR与Impala结合的数据查询优化研究[J].微型电脑应用,2016,32(6):29-31. 被引量：1
4田秀劳,柳华勃,廖聪,井光文,梁小江,王贝贝,张正军,徐嘉驰.Phoenix＋HBase存储仓库在流动人口统计中的应用[J].西安邮电大学学报,2017,22(1):111-115.
5张锐.基于Hive数据仓库的物流大数据平台的研究与设计[J].电子设计工程,2017,25(9):31-35. 被引量：9
6邱婷,敬敏.基于TPC-H模型的大数据平台查询性能对比研究[J].信息化研究,2018,44(2):9-13.
7万辉,李华光,朱晓华,徐明强.海洋空间情报大数据应用发展[J].中国航海,2019,42(3):76-81. 被引量：2
8丁岩,杨万祥,汪清,杨乐,胡晓.大数据统一SQL引擎研究与设计[J].科技视界,2019,0(29):1-4. 被引量：4
9曹成,陶继群,郑湃.基于Kudu的电力辅助设备实时监控业务解决方案[J].科技创新与应用,2021(8):130-134. 被引量：2
10曹雪朋.基于Django的数据分析系统设计与实现[J].信息与电脑,2023,35(15):141-143.

二级引证文献56

1刘金伟,刘剑,李德波,苗建杰,阙正斌.火电厂效能大数据优化关键技术研究[J].环境工程,2023,41(S02):941-946. 被引量：2
2金玉玕,尚庆华,曹长群.二叠纪地层研究述评[J].地层学杂志,2000,24(2):99-108. 被引量：58
3张传远,杨夏祎,梁薇,齐永忠.电力工控系统攻击威胁分析技术研究[J].电气应用,2019,38(2):85-90. 被引量：3
4简燕红,符士侃.数据立方体技术在电力数据统计分析中的运用与研究[J].科技创新导报,2016,13(31):9-10. 被引量：1
5莫国柱,高鹏,于国际.基于海量数据下的分布式IT资产安全监测分析[J].自动化与仪器仪表,2017(4):141-143. 被引量：3
6杨华玲,王力锋.物流仓储多层级供应链均衡控制仿真[J].计算机仿真,2017,34(12):225-227. 被引量：2
7何文博,赵常青.带轴套转子的动力学稳定性研究[J].计算机仿真,2017,34(12):321-325.
8廖多杨.医院临床数据分析智能分类处理技术研究[J].计算机测量与控制,2018,26(2):183-185. 被引量：2
9眭冠男.多维分析技术在大数据环境下的发展[J].电脑知识与技术,2018,14(2):4-5. 被引量：3
10邱婷,敬敏.基于TPC-H模型的大数据平台查询性能对比研究[J].信息化研究,2018,44(2):9-13.

1周城,高鲁程,李兴德.DNS的服务过程与安全性分析[J].重庆通信学院学报,2004,23(1):61-63.
2卢亿雷.Cloudera进入中国背后的故事[J].程序员,2014(6):15-16.
3袁航.软硬相济拓展大数据市场[J].软件和信息服务,2014(6):44-44.
4Cloudera Hadoop[J].程序员,2009(4):146-146.
5NetApp携Cloudera提供卓越的Hadoop解决方案[J].硅谷,2011(22):73-73. 被引量：2
6Impala1．0发佰[J].程序员,2013(6):13-13.
7英特尔携手Cloudera助推中国用户大数据掘金进程[J].中国信息化,2014(10):71-71.
8姜合,栾秀梅,董祥军.基于Web的网络教学系统的设计与实现[J].微计算机应用,2004,25(4):426-429. 被引量：3
9Hadoop大数据企业Cloudera获6500万美元融资[J].中国建设信息,2012(24):4-5.
10王喆.基于.NET的作业处理系统的设计与实现[J].计算机应用与软件,2012,29(4):213-215. 被引量：4

计算机应用研究

2015年第5期

浏览历史

内容加载中请稍等...

基于Impala的大数据查询分析计算性能研究被引量：12

参考文献13

同被引文献111

引证文献12

二级引证文献56

相关作者

相关机构

相关主题

浏览历史

基于Impala的大数据查询分析计算性能研究 被引量：12

参考文献13

同被引文献111

引证文献12

二级引证文献56

相关作者

相关机构

相关主题

浏览历史

基于Impala的大数据查询分析计算性能研究被引量：12