期刊文献+

基于Spark/Shark的电力用采大数据OLAP分析系统 被引量:5

Spark/Shark-based OLAP system for smart grid applications
下载PDF
导出
摘要 用电信息大数据上的OLAP查询涉及数据量大,具有多表连接操作频繁、SQL结构复杂等特点,传统关系型数据库面对该类应用,表现出可扩展性弱、数据写入吞吐量低与查询效率低等问题.为此设计了一套基于Spark/Shark的电力大数据OLAP分析系统,该系统采用分布式文件系统HDFS保存电力用电信息采集系统的大数据,通过Shark进行前端SQL解析,Spark进行查询计算;然而,原生Shark只支持粗粒度分区,不支持细粒度的索引技术,难以高效地过滤无关数据,影响了查询性能.为克服这一不足,该系统设计了一种基于前缀树的细粒度索引结构TrieIndex,并通过数据重组技术优化了数据在HDFS的分布,提升了Shark的数据过滤能力以及用电信息大数据OLAP分析的性能.真实用电信息采集系统数据与查询的实验结果表明,该系统比关系型数据库的写入速度提升了12倍,比原生Shark的查询效率提升了10倍以上. The OLAP queries on electricity consumption information in Smart Grid have some prominent features: huge amounts of data, involving multiple tables in a joint operation, complex SQL structure, etc. Faced with this kind of applications, traditional RDBMS always leads to poor scalability, low write throughput, and unacceptable query performance, etc. A Spark/Shark-Based OLAP system for electricity consumption information in smart grid was designed. The system used distributed file system HDFS for data storage, and makes use of Shark to parse the SQL queries and Spark to execute them. However, fine-grained index, which hmclers turmer unioltovc~ "t ~ t----- Shark does not support Trie tree based fine-grained index technique TrieIndex and data re-organization overcome this limitation, a ts with real electrmlty scheme for better query performance was proposed. The experiment resul consumption information data and query show that the write throughput of the system is 12 times faster than that of RDBMS, and the query efficiency of the system is 10 times greater than that of original Shark.
出处 《中国科学技术大学学报》 CAS CSCD 北大核心 2016年第1期66-75,共10页 JUSTC
基金 国家电网公司科技项目(SGJSXT00YWJS1400072)资助
关键词 SPARK OLAP 电力大数据 索引 前缀树 Spark OLAP power big data index Trie tree
  • 相关文献

参考文献13

  • 1Apache Hadoop. Welcome to apache hadoop[EB/OL]. https://hadoop, apache, org/.
  • 2Spark. Lightning fast cluster computing[EB/OL]. https ://spark. apache, org/.
  • 3ZahariaM, Chowdhury M, Franklin M J, et al. Spark: Cluster computing with working sets [C]// Proceedings of the 2nd USENIX Conference on Hot Tropics in Cloud Computing. Boston, USA: USENIX, 2010: 10-14.
  • 4Xin R S, Rosen J, Zaharia M, et al. Shark: SQL and rich analytics at scale[C]// Proceedings of the ACM SIGMOD International Conference on Management ofData. New York, USA: ACM Press, 2013..13-24.
  • 5Abouzeid A, Bajda-Pawlikowski K, Abadi D, et al. HadoopDB.. An architectural hybrid of MapReduce and DBMS technologies for analytical workloads [J ]. Proceedings of the VLDB Endowment, 2009, 2 (1) .. 922-933.
  • 6Jiang D W, Ooi B C, Shi L, et al. The performance of MapReduce: An in-depth study[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2): 472-483.
  • 7Dittrich J, Quian6-Ruiz J A, Jindal A, et al. Hadoop + +.. Making a yellow elephant run like a cheetah (without it even noticing) [J]. Proceedings of the VLDB Endowment, 2010, 3(1-2).. 515-529.
  • 8Ehabakh M Y, Ozcan F, Sismanis Y, et al. Eagle- eyed elephant.- Split-oriented indexing in Hadoop[C]// Proceedings of the 16th International Conference on Extending Database Technology. Genoa, Italy: ACM Press, 2013: 89-100.
  • 9Liu Y, Hu S L, Rabl T, et al. DGFIndex for smart grid.. Enhancing hive with a cost-effectivemultidimensional range irldex[C]// 40th International Conference on VLDB. Hangzhou, China.. ACM Press, 2014:1496 1507.
  • 10彭小圣,邓迪元,程时杰,文劲宇,李朝晖,牛林.面向智能电网应用的电力大数据关键技术[J].中国电机工程学报,2015,35(3):503-511. 被引量:526

二级参考文献17

共引文献525

同被引文献38

引证文献5

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部