Pragma Directed Shared Memory Centric Optimizations on GPUs 被引量：1

Pragma Directed Shared Memory Centric Optimizations on GPUs

导出

摘要 GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is not an easy work. It often requires programmer expertise and nontrivial parameter selection. Improper shared memory usage might even underutilize GPU resource： Even using state-of-the-art high level programming models （e.g., OpenACC and OpenHMPP）, it is still hard to utilize shared memory since they lack inherent support in describing shared memory optimization and selecting suitable parameters, let alone maintaining high resource utilization. Targeting higher productivity for affine applications, we propose a data centric way to shared memory optimization on GPU. We design a pragma extension on OpenACC so as to convey data management hints of programmers to compiler. Meanwhile, we devise a compiler framework to automatically select optimal parameters for shared arrays, using the polyhedral model. We further propose optimization techniques to expose higher memory and instruction level parallelism. The experimental results show that our shared memory centric approaches effectively improve the performance of five typical GPU applications across four widely used platforms by 3.7x on average, and do not burden programmers with lots of pragmas. GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is not an easy work. It often requires programmer expertise and nontrivial parameter selection. Improper shared memory usage might even underutilize GPU resource： Even using state-of-the-art high level programming models （e.g., OpenACC and OpenHMPP）, it is still hard to utilize shared memory since they lack inherent support in describing shared memory optimization and selecting suitable parameters, let alone maintaining high resource utilization. Targeting higher productivity for affine applications, we propose a data centric way to shared memory optimization on GPU. We design a pragma extension on OpenACC so as to convey data management hints of programmers to compiler. Meanwhile, we devise a compiler framework to automatically select optimal parameters for shared arrays, using the polyhedral model. We further propose optimization techniques to expose higher memory and instruction level parallelism. The experimental results show that our shared memory centric approaches effectively improve the performance of five typical GPU applications across four widely used platforms by 3.7x on average, and do not burden programmers with lots of pragmas.

作者 Jing Li CCF, Lei Liu Yuan Wu Xiang-Hua Liu Yi Gao Xiao-Bing Feng Cheng-YongWu

机构地区 State Key Laboratory of Computer Architecture University of Chinese Academy of Sciences Beijing Samsung Telecom Research and Development Center

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2016年第2期235-252,共18页 计算机科学技术学报（英文版）

基金 This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010902, the National Natural Science Foundation of China （NSFC） under Grant No. 61432018, and the Innovation Research Group of NSFC under Grant No. 61221062.

关键词 GPU shared memory pragma directed data centric GPU, shared memory, pragma directed, data centric

分类号 TP316 [自动化与计算机技术—计算机软件与理论] TP334.7 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献34

1Ruetsch G, Micikevicius P. Optimizing matrix trans- pose in CUDA. http://www.cs.colostate.edu/cs675/Mat- rixTranspose.pdf, Jan. 2009.
2Fujimoto N. Faster matrix-vector multiplication on GeForce 8800GTX. In Proc. IEEE International Symposium on Parallel and Distributed Processing, Apr. 2008.
3Van Werkhoven 13, Maassen J, Bal H E, Seinstra F J. Op- timizing convolution operations on GPUs using adaptive tiling. Future Gener. Comput. Syst., 2014, 30: 14-26.
4Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proe. the 2010 ACM/IEEE Interna- tional Conference for High Performance Computing, Net- working, Storage and Analysis, Nov. 2010.
5Yang Y, Xiang P, Kong J, Zhou H. A GPGPU compiler for memory optimization and parallelism management. In Proe. the 31st ACM SIGPLAN Conference on Program- ming Language Design and Implementation, Jun. 2010 pp.86-97.
6Kandemir M, Kadayif I, Sezer U. Exploiting scratch-pad memory using Presburger formulas. In Proc. the 14th In- ternational Symposium on Systems Synthesis, Sept. 2001, pp.7-12.
7Ueng S Z, Lathara M, Baghsorkhi S, Hwu W. CUDA-Lite: Reducing GPU programming complexity. In Proc. the Lan- guages and Compilers for Parallel Computing, July 3-Aug. 2, 2008, pp.l-15.
8Yang Y, Xiang P, Mantor M, Rubin N, Zhou H. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proe. the 21st International Conference on Parallel Architectures and Compilation Techniques, Sept. 2012, pp.283-292.
9Jablin J A, Jablin T B, Mutlu O, Herlihy M. Warp-aware trace scheduling for GPUs. In Proc. the 23rd Interna- tional Conference on Parallel Architectures and Compila- tion, Aug. 2014, pp.163-174.
10Schgfer A, Fey D. High performance stencil code algorithms for GPGPUs. Procedia Computer Science, 2011, 4: 2027- 2036.

同被引文献1

1陈立前,王戟,刘万伟.基于约束的多面体抽象域的弱接合[J].软件学报,2010,21(11):2711-2724. 被引量：3

引证文献1

1赵捷,李颖颖,赵荣彩.基于多面体模型的编译“黑魔法”[J].软件学报,2018,29(8):2371-2396. 被引量：12

二级引证文献12

1李颖颖,赵捷,庞建民.多面体模型中分裂分块算法的设计与实现[J].计算机学报,2020,43(6):1010-1023. 被引量：2
2柴晓菲,刘松,屈彬,王倩,伍卫国.向量化友好的循环分块因子选择算法[J].计算机工程与应用,2020,56(15):37-42.
3胡伟方,陈云,李颖颖,商建东.基于数据重用分析的多面体循环合并策略[J].计算机科学,2021,48(12):49-58.
4夏文博,胡伟方,郭浩然.面向多面体模型的静态控制块识别扩展方法[J].计算机应用与软件,2022,39(3):19-24.
5王博漾,庞建民,徐金龙,赵捷,陶小涵,朱雨.基于多面体模型的矩阵乘法向量代码生成[J].计算机科学,2022,49(10):44-51. 被引量：2
6陶小涵,朱雨,庞建民,赵捷,徐金龙.面向申威异构架构的并行代码自动生成[J].软件学报,2023,34(4):1570-1593. 被引量：4
7宋广辉,郭绍忠,赵捷,陶小涵,李飞,许瑾晨.面向Stencil计算的自动混合精度优化[J].软件学报,2023,34(12):5704-5723. 被引量：2
8彭畅,刘青枝,陈长波.多面体模型下的循环置换与自动调优[J].计算机工程与科学,2023,45(12):2121-2134.
9张鹏,张爱清,莫则尧,王景焘.SEMD:一种面向实际数值模拟软件的跨平台自动性能优化编程工具[J].计算物理,2024,41(1):52-63.
10胡煜霄,郑启龙.基于深度学习的循环自动调度研究[J].小型微型计算机系统,2024,45(7):1770-1777.

1Intersil的单向内核控制器为Santa Rosa平台GPU供电[J].电子与电脑,2006(11):82-82.
2分布式计算面临挑战[J].中国信息化,2007(24):23-23.
3于莹莹.设备投资辅助决策系统研究[J].哈尔滨铁道科技,2013(2):1-2.
4Instructions to authors[J].World Journal of Gastroenterology,2011,17(5):683-688.
5GENERAL INFORMATION[J].World Journal of Gastroenterology,2011,17(39).
6Fatemeh Azmandian,Member, IEEE, Ayse Yilmazer,Student Member, IEEE, Jennifer G. Dy,Member, IEEE Javed A. Aslam,IEEE, Jennifer G. Dy,Member, ACM,David R. Kaeli,Fellow, IEEE, Member, ACM.Harnessing the Power of GPUs to Speed Up Feature Selection for Outlier Detection[J].Journal of Computer Science & Technology,2014,29(3):408-422.
7Gabriel Falcao,Student Member,Shinichi Yamagiwa,Vitor Silva,Leonel Sousa,Senior Member.Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach[J].Journal of Computer Science & Technology,2009,24(5):913-924. 被引量：2
8吕涛涛,邓正宏.基于Shared Memory的多核算法处理系统及实现[J].现代电子技术,2013,36(6):10-14.
9尤洪涛,张立博,毛智辉.OpenACC2．0VSOpenMP4．0：基于编译指示的两种主流众核编程语言的对比研究[J].高性能计算技术,2014,0(2):20-25. 被引量：1
10SHI Xiaohua,WU Gansha,JIN Maozhong,LUEH Guei-Yuan.Optimizations and Deoptimizations for Escape Analysis in Open World[J].Chinese Journal of Electronics,2010,19(2):211-216.

Journal of Computer Science & Technology

2016年第2期

浏览历史

内容加载中请稍等...

Pragma Directed Shared Memory Centric Optimizations on GPUs 被引量：1

参考文献34

同被引文献1

引证文献1

二级引证文献12

相关作者

相关机构

相关主题

浏览历史