Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer 被引量：3

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

导出

摘要 In this paper we present the programming of the Linpack benchmark on TianHe-1 system,the first petascale supercomputer system of China,and the largest GPU-accelerated heterogeneous system ever attempted before.A hybrid programming model consisting of MPI,OpenMP and streaming computing is described to explore the task parallel,thread parallel and data parallel of the Linpack.We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details.To overcome the low-bandwidth between the CPU and GPU communication,we present a software pipelining technique to hide the communication overhead.Combined with other traditional optimizations,the Linpack we developed achieved 196.7 GFLOPS on a single compute element of TianHe-1.This result is 70.1% of the peak compute capability,3.3 times faster than the result by using the vendor＇s library.On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563 PFLOPS,which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November,2009. In this paper we present the programming of the Linpack benchmark on TianHe-1 system,the first petascale supercomputer system of China,and the largest GPU-accelerated heterogeneous system ever attempted before.A hybrid programming model consisting of MPI,OpenMP and streaming computing is described to explore the task parallel,thread parallel and data parallel of the Linpack.We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details.To overcome the low-bandwidth between the CPU and GPU communication,we present a software pipelining technique to hide the communication overhead.Combined with other traditional optimizations,the Linpack we developed achieved 196.7 GFLOPS on a single compute element of TianHe-1.This result is 70.1% of the peak compute capability,3.3 times faster than the result by using the vendor＇s library.On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563 PFLOPS,which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November,2009.

作者王锋杨灿群杜云飞陈娟易会战徐炜遐

机构地区 School of Computer Science

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第5期854-865,共12页 计算机科学技术学报（英文版）

基金 Supported by the National High Technology Research and Development 863 Program of China under Grant No.2009AA01A128 the Major Science and Technology Project of China under Grant No.2009ZX01036-001-003-001 the National Natural Science Foundation of China under Grant Nos.61003087,60903044,60903059,60970033,and60673150

关键词 petascale LINPACK GPU HETEROGENEOUS SUPERCOMPUTER petascale Linpack GPU heterogeneous supercomputer

分类号 TP338 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献1

1Gabriel Falcao,Student Member,Shinichi Yamagiwa,Vitor Silva,Leonel Sousa,Senior Member.Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach[J].Journal of Computer Science & Technology,2009,24(5):913-924. 被引量：2

二级参考文献30

1Gallager R G. Low-density parity-check codes. IRE Transactions on Information Theory, 1962, 8(1): 21-28.
2Mackay D J C, Neal R M. Near Shannon limit performance of low density parity check codes. IEE Electronics Letters, 1996, 32(18): 1645-1646.
3Lin S, Costello D J. Error Control Coding. 2nd Ed., Prentice Hall, 2004.
4Tanner R. A recursive approach to low complexity codes. IEEE Transactions on Information Theory, 1981, 27(5): 533- 547.
5Quaglio F, Vacca F, Castellano C, Tarable A, Masera G. Interconnection framework for high-throughput, flexible LDPC decoders. In Proc. Design, Automation and Test in Europe (DATE2006), Munich, Germany, March 6-10, 2006, pp.124- 129.
6Ping L, Leung W K. Decoding low density parity check codes with finite quantization bits. IEEE Communications Letters, 2000, 4(2): 62-64.
7Zhang T, Parhi K. Joint (3, k)-regular LDPC code and decoder/encoder design. IEEE Transactions on Signal Processing, 2004, 52(4): 1065-1079.
8Verdier F, Declercq D. A low-cost parallel scalable FPGA architecture for regular and irregular LDPC decoding. IEEE Transactions on Communications, 2006, 54(7): 1215-1223.
9Falcao G, Gomes M, Gonqalves J, Faia P, Silva V. HDL library of processing units for an automatic LDPC decoder design. In Proc. IEEE Ph.D. Research in Microelectronics and Electronics (PRIMB), Otranto, Italy, June 11-16, 2006, pp.349-352.
10Comes M, Silva V, Neves C, Marques R. Serial LDPC decoding on a SIMD DSP using horizontal-scheduling. In Proc. 14th European Signal Processing Conference (EUSIPCO2006), Florence, Italy, Sept. 4 8, 2006.

共引文献1

1原略超,张洋,唐川,邢座程.基于GPGPU的LDPC解码访存优化技术[J].中国科技论文,2013,8(7):626-632.

同被引文献7

1NVIDIA CUDA Compute Unified Device Architecture Programming Guide[Z].
2HPL-A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers[OL].ht tp://www.netlib.org/benchmark/hpl/.
3Phillips E.CUDA Accelerated Linpack on Clusters,SC09[OL].http://developer.nvidia.com/get started-parallel-computing.
4NVIDIA Tesla GPGPU-Introduction of tesla C2070 and C2050[OL].http://www.nvidia.cn/object/product_tesla_C2050C2070_cn.html.
5Linpack on NVIDIA GPU[OL].http://tech.it168.com/a2010/0723/1081/000001081479_all.shtml.
6NVIDLA.Fermi compute architecture whitepaper[Z].2009.
7HPC Challenge Benchmarks[OL].http://icl.cs.utk.edu/hpcc.

引证文献3

1陈任之,黄立波,陈顼颢,王志英.单节点多GPU集群下HPL动态负载均衡优化[J].计算机科学,2013,40(3):107-110. 被引量：3
2王锋,杜云飞,陈娟.GPGPU性能模型研究[J].计算机工程与科学,2013,35(12):1-7. 被引量：1
3黎雷生,杨文浩,马文静,张娅,赵慧,赵海涛,李会元,孙家昶.复杂异构计算系统HPL的优化[J].软件学报,2021,32(8):2307-2318. 被引量：2

二级引证文献6

1佘堃,潘松松,田文洪.基于虚拟化的声纹识别系统性能研究[J].成都信息工程学院学报,2015,30(2):107-112. 被引量：1
2陈艳阳,曾卫明.GPU加速的自适应仿射传播聚类方法[J].计算机系统应用,2016,25(11):146-150.
3蔡雨,孙成国,杜朝晖,刘子行,康梦博,李双双.异构HPL算法中CPU端高性能BLAS库优化[J].软件学报,2021,32(8):2289-2306. 被引量：2
4孙乔,孙家昶,马文静,赵玉文.面向异构计算机平台的HPL方案[J].软件学报,2021,32(8):2329-2340.
5周琰,马强.基于混合模式匹配算法的网络入侵检测[J].计算机测量与控制,2022,30(11):65-70. 被引量：4
6王占立,马胜,许邦建,杨柳.一种支持阻塞分段传输的DMA部件的设计与实现[J].计算机研究与发展,2014,51(S1):117-122.

1张云泉,孙家昶,袁国兴,张林波.2008年中国高性能计算机TOP100排行榜分析与展望[J].科研信息化技术与应用,2008(3):71-78. 被引量：2
2高春熙.TOP500 Supercomputer的启示：兼论MPP前景[J].电子计算机,1996(3):7-19.
3中国最新TOP100性能分析[J].高性能计算技术,2004,0(6):5-5.
42007年中国高性能计算机TOP100排行榜（前50名）[J].品质．品牌,2008(1):30-31.
5禾刀.第40届全球超级计算机TOP500分析[J].高性能计算技术,2012,0(6):70-70.
6IBM获得2005年中国高性能计算机排名TOP100榜单冠军[J].中国金融电脑,2005(12):77-77.
7孙凝辉,邢晶,霍志刚,谭光明,熊劲,李波,马灿.Dawning Nebulae:A PetaFLOPS Supercomputer with a Heterogeneous Structure[J].Journal of Computer Science & Technology,2011,26(3):352-362. 被引量：3
8谢伟.Internet安全及相关技术应用[J].电子技术与软件工程,2013(21):235-236.
9盛艳.浅谈当前世界上的超级计算机[J].光盘技术,2009(1):8-8. 被引量：1
10木晓,珂庆.第35次全球HPCTOP500前10台系统专题技术分析[J].高性能计算技术,2010(3).

Journal of Computer Science & Technology

2011年第5期

浏览历史

内容加载中请稍等...