摘要
用电信息大数据上的OLAP查询涉及数据量大,具有多表连接操作频繁、SQL结构复杂等特点,传统关系型数据库面对该类应用,表现出可扩展性弱、数据写入吞吐量低与查询效率低等问题.为此设计了一套基于Spark/Shark的电力大数据OLAP分析系统,该系统采用分布式文件系统HDFS保存电力用电信息采集系统的大数据,通过Shark进行前端SQL解析,Spark进行查询计算;然而,原生Shark只支持粗粒度分区,不支持细粒度的索引技术,难以高效地过滤无关数据,影响了查询性能.为克服这一不足,该系统设计了一种基于前缀树的细粒度索引结构TrieIndex,并通过数据重组技术优化了数据在HDFS的分布,提升了Shark的数据过滤能力以及用电信息大数据OLAP分析的性能.真实用电信息采集系统数据与查询的实验结果表明,该系统比关系型数据库的写入速度提升了12倍,比原生Shark的查询效率提升了10倍以上.
The OLAP queries on electricity consumption information in Smart Grid have some prominent features: huge amounts of data, involving multiple tables in a joint operation, complex SQL structure, etc. Faced with this kind of applications, traditional RDBMS always leads to poor scalability, low write throughput, and unacceptable query performance, etc. A Spark/Shark-Based OLAP system for electricity consumption information in smart grid was designed. The system used distributed file system HDFS for data storage, and makes use of Shark to parse the SQL queries and Spark to execute them. However, fine-grained index, which hmclers turmer unioltovc~ "t ~ t----- Shark does not support Trie tree based fine-grained index technique TrieIndex and data re-organization overcome this limitation, a ts with real electrmlty scheme for better query performance was proposed. The experiment resul consumption information data and query show that the write throughput of the system is 12 times faster than that of RDBMS, and the query efficiency of the system is 10 times greater than that of original Shark.
基金
国家电网公司科技项目(SGJSXT00YWJS1400072)资助