基于关键列分组排序的列存储结构

A columnar storage structure based on group sorting of key columns

下载PDF

导出

摘要磁盘作为海量数据的主要存储介质,具有容量大、成本低的优点,但是磁盘IO带宽远远落后于数据增长速度,日益成为大数据管理系统的性能瓶颈。因此,优化存储结构、提高读写效率是大数据时代管理系统面临的重要挑战。提出了一种基于关键列分组排序的混合列存储结构KCGS-Store,根据关键列分组将关系表划分为存储池,确保池内所有记录在关键列上的取值或取值范围相同,然后逐列进行池合并。合并后的关键列,以池为单位有序排列,执行条件查询时能够有效过滤无关列值,减少数据读取量,提升查询性能。同时利用池号索引,以少量时间空间代价完成记录重组。实验数据表明,与ORCFile、Parquet存储结构相比,KCGS-STORE在存储空间、数据加载、SQL查询等方面都有不同程度的优化。 As the main storage medium for massive data, disks have the advantages of large capacity and low cost. However, the I/O bandwidth of disks lags far behind the growing speed of data, which thus becomes the performance bottleneck of big data management systems. Therefore, optimizing the storage structure to improve the efficiency of writing and reading has become one important challenge in the era of big data. In this paper, we present a columnar storage structure based on key columns group sorting called KCGS-Store. According to the groups of the key columns, the tables are divided into pools and the records in the same pool have the same value or value range. All pools belonging to one group are merged, and then the key columns are orderly arranged by taking the pool as the unit. In this way, irrelevant column values can be effectively filtered when executing SQL commands so as to reduce the amount of data being read. Consequently, the query performance can be improved. Meanwhile, using the pool matrix, we can reorganize the records at very little cost of time and space. Evaluation results show that compared with the ORCFile and the Parquet, the KCGS-Store is superior in many aspects, including storage space, data loading and SQL querying.

作者徐涛顾瑜汪东升

机构地区清华大学计算机科学与技术系

出处《计算机工程与科学》 CSCD 北大核心 2016年第8期1536-1541,共6页 Computer Engineering & Science

基金国家自然科学基金(61373025 61303002)

关键词 HADOOP 列存储组排序大数据 Hadoop columnar storage group ranking big data

分类号 TP333 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献11

1Jacobs A.The pathologies of big data[J].Communications of the ACM,2009,52(8):3644.
2Ghemawat S,Gobioff H,Leung S T.The Google file system[J].ACM SIGOPS Operating Systems Review,2003,37(5):29-43.
3Shvachko K, Kuang H, Radia S, et al. The Hadoop distributed file system[C]∥Proc of IEEE Conference on Mass Storage Systems and Technologies,2010:110.
4Floratou A,Minhas U F,Ozcan F.SQLonHadoop: Full circle back to sharednothing database architectures[J].Proceedings of the VLDB Endowment,2014,7(12):1295-1306.
5Batory D S.On searching transposed files[J].ACM Transactions on Database Systems,1979,4(4):531-544.
6Copeland G P, Khoshafian S N. A decomposition storage model[J].ACM SIGMOD Record,1985,14(4):268-279.
7He Y, Lee R, Huai Y, et al.RCFile: A fast and spaceefficient data placement structure in MapReducebased warehouse systems[C]∥Proc of the 2011 IEEE 27th International Conference on Data Engineering,2011:11991208.
8Hortonworks Inc.ORC Files [EB/OL]. [2016-02-09].https://issues.apache.org/jira/secure/attachment/12564124/OrcFileIntro.pptx.
9Trevni [EB/OL]. [2016-02-09]. https://github.com/cutting/trevni.
10Cloudera Enterprise. Parquet[EB/OL].[2016-02-09].https://github.com/Parquet.

1周建钦,赵志远.可减少平均工作量的分组排序和查找算法[J].计算机世界月刊,1991(4):17-17.
284新生代.排排座分音乐[J].电脑爱好者,2006,0(3):26-26.
3韩煜尘（文/图）.1秒钟更改好友分组排序[J].网友世界,2008(15):62-62.
4胡亮,黄志刚,梁远标.STL模型切片数据的生成算法研究[J].机械工程与自动化,2016(2):40-41. 被引量：3
5麦儿.让任务栏分组更自我[J].网友世界,2004(10):33-33.
6周建钦,刘世民.汉字的分组排序和查找算法[J].微电子学与计算机,1995,12(5):28-30. 被引量：1
7青山漫步.快速调整QQ分组排列顺序[J].计算机应用文摘,2009(11):31-31.
8GG.你用好IM的分组功能了吗?[J].电脑爱好者,2006,0(2):39-39.
9周建钦.关于汉字的分组排序算法及其复杂性[J].中文信息学报,1996,10(3):58-64. 被引量：2
10吴骏.对汉字排序和查找方法探讨[J].安徽大学学报（自然科学版）,1996,20(2):26-31. 被引量：1

计算机工程与科学

2016年第8期

浏览历史

内容加载中请稍等...

基于关键列分组排序的列存储结构

参考文献11

相关作者

相关机构

相关主题

浏览历史