期刊文献+

基于关键列分组排序的列存储结构

A columnar storage structure based on group sorting of key columns
下载PDF
导出
摘要 磁盘作为海量数据的主要存储介质,具有容量大、成本低的优点,但是磁盘IO带宽远远落后于数据增长速度,日益成为大数据管理系统的性能瓶颈。因此,优化存储结构、提高读写效率是大数据时代管理系统面临的重要挑战。提出了一种基于关键列分组排序的混合列存储结构KCGS-Store,根据关键列分组将关系表划分为存储池,确保池内所有记录在关键列上的取值或取值范围相同,然后逐列进行池合并。合并后的关键列,以池为单位有序排列,执行条件查询时能够有效过滤无关列值,减少数据读取量,提升查询性能。同时利用池号索引,以少量时间空间代价完成记录重组。实验数据表明,与ORCFile、Parquet存储结构相比,KCGS-STORE在存储空间、数据加载、SQL查询等方面都有不同程度的优化。 As the main storage medium for massive data, disks have the advantages of large capacity and low cost. However, the I/O bandwidth of disks lags far behind the growing speed of data, which thus becomes the performance bottleneck of big data management systems. Therefore, optimizing the storage structure to improve the efficiency of writing and reading has become one important challenge in the era of big data. In this paper, we present a columnar storage structure based on key columns group sorting called KCGS-Store. According to the groups of the key columns, the tables are divided into pools and the records in the same pool have the same value or value range. All pools belonging to one group are merged, and then the key columns are orderly arranged by taking the pool as the unit. In this way, irrelevant column values can be effectively filtered when executing SQL commands so as to reduce the amount of data being read. Consequently, the query performance can be improved. Meanwhile, using the pool matrix, we can reorganize the records at very little cost of time and space. Evaluation results show that compared with the ORCFile and the Parquet, the KCGS-Store is superior in many aspects, including storage space, data loading and SQL querying.
出处 《计算机工程与科学》 CSCD 北大核心 2016年第8期1536-1541,共6页 Computer Engineering & Science
基金 国家自然科学基金(61373025 61303002)
关键词 HADOOP 列存储 组排序 大数据 Hadoop columnar storage group ranking big data
  • 相关文献

参考文献11

  • 1Jacobs A.The pathologies of big data[J].Communications of the ACM,2009,52(8):3644.
  • 2Ghemawat S,Gobioff H,Leung S T.The Google file system[J].ACM SIGOPS Operating Systems Review,2003,37(5):29-43.
  • 3Shvachko K, Kuang H, Radia S, et al. The Hadoop distributed file system[C]∥Proc of IEEE Conference on Mass Storage Systems and Technologies,2010:110.
  • 4Floratou A,Minhas U F,Ozcan F.SQLonHadoop: Full circle back to sharednothing database architectures[J].Proceedings of the VLDB Endowment,2014,7(12):1295-1306.
  • 5Batory D S.On searching transposed files[J].ACM Transactions on Database Systems,1979,4(4):531-544.
  • 6Copeland G P, Khoshafian S N. A decomposition storage model[J].ACM SIGMOD Record,1985,14(4):268-279.
  • 7He Y, Lee R, Huai Y, et al.RCFile: A fast and spaceefficient data placement structure in MapReducebased warehouse systems[C]∥Proc of the 2011 IEEE 27th International Conference on Data Engineering,2011:11991208.
  • 8Hortonworks Inc.ORC Files [EB/OL]. [2016-02-09].https://issues.apache.org/jira/secure/attachment/12564124/OrcFileIntro.pptx.
  • 9Trevni [EB/OL]. [2016-02-09]. https://github.com/cutting/trevni.
  • 10Cloudera Enterprise. Parquet[EB/OL].[2016-02-09].https://github.com/Parquet.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部