期刊文献+

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer 被引量:3

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer
原文传递
导出
摘要 In this paper we present the programming of the Linpack benchmark on TianHe-1 system,the first petascale supercomputer system of China,and the largest GPU-accelerated heterogeneous system ever attempted before.A hybrid programming model consisting of MPI,OpenMP and streaming computing is described to explore the task parallel,thread parallel and data parallel of the Linpack.We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details.To overcome the low-bandwidth between the CPU and GPU communication,we present a software pipelining technique to hide the communication overhead.Combined with other traditional optimizations,the Linpack we developed achieved 196.7 GFLOPS on a single compute element of TianHe-1.This result is 70.1% of the peak compute capability,3.3 times faster than the result by using the vendor's library.On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563 PFLOPS,which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November,2009. In this paper we present the programming of the Linpack benchmark on TianHe-1 system,the first petascale supercomputer system of China,and the largest GPU-accelerated heterogeneous system ever attempted before.A hybrid programming model consisting of MPI,OpenMP and streaming computing is described to explore the task parallel,thread parallel and data parallel of the Linpack.We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details.To overcome the low-bandwidth between the CPU and GPU communication,we present a software pipelining technique to hide the communication overhead.Combined with other traditional optimizations,the Linpack we developed achieved 196.7 GFLOPS on a single compute element of TianHe-1.This result is 70.1% of the peak compute capability,3.3 times faster than the result by using the vendor's library.On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563 PFLOPS,which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November,2009.
出处 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第5期854-865,共12页 计算机科学技术学报(英文版)
基金 Supported by the National High Technology Research and Development 863 Program of China under Grant No.2009AA01A128 the Major Science and Technology Project of China under Grant No.2009ZX01036-001-003-001 the National Natural Science Foundation of China under Grant Nos.61003087,60903044,60903059,60970033,and60673150
关键词 petascale LINPACK GPU HETEROGENEOUS SUPERCOMPUTER petascale Linpack GPU heterogeneous supercomputer
  • 相关文献

参考文献1

二级参考文献30

  • 1Gallager R G. Low-density parity-check codes. IRE Transactions on Information Theory, 1962, 8(1): 21-28.
  • 2Mackay D J C, Neal R M. Near Shannon limit performance of low density parity check codes. IEE Electronics Letters, 1996, 32(18): 1645-1646.
  • 3Lin S, Costello D J. Error Control Coding. 2nd Ed., Prentice Hall, 2004.
  • 4Tanner R. A recursive approach to low complexity codes. IEEE Transactions on Information Theory, 1981, 27(5): 533- 547.
  • 5Quaglio F, Vacca F, Castellano C, Tarable A, Masera G. Interconnection framework for high-throughput, flexible LDPC decoders. In Proc. Design, Automation and Test in Europe (DATE2006), Munich, Germany, March 6-10, 2006, pp.124- 129.
  • 6Ping L, Leung W K. Decoding low density parity check codes with finite quantization bits. IEEE Communications Letters, 2000, 4(2): 62-64.
  • 7Zhang T, Parhi K. Joint (3, k)-regular LDPC code and decoder/encoder design. IEEE Transactions on Signal Processing, 2004, 52(4): 1065-1079.
  • 8Verdier F, Declercq D. A low-cost parallel scalable FPGA architecture for regular and irregular LDPC decoding. IEEE Transactions on Communications, 2006, 54(7): 1215-1223.
  • 9Falcao G, Gomes M, Gonqalves J, Faia P, Silva V. HDL library of processing units for an automatic LDPC decoder design. In Proc. IEEE Ph.D. Research in Microelectronics and Electronics (PRIMB), Otranto, Italy, June 11-16, 2006, pp.349-352.
  • 10Comes M, Silva V, Neves C, Marques R. Serial LDPC decoding on a SIMD DSP using horizontal-scheduling. In Proc. 14th European Signal Processing Conference (EUSIPCO2006), Florence, Italy, Sept. 4 8, 2006.

共引文献1

同被引文献7

  • 1NVIDIA CUDA Compute Unified Device Architecture Programming Guide[Z].
  • 2HPL-A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers[OL].ht tp://www.netlib.org/benchmark/hpl/.
  • 3Phillips E.CUDA Accelerated Linpack on Clusters,SC09[OL].http://developer.nvidia.com/get started-parallel-computing.
  • 4NVIDIA Tesla GPGPU-Introduction of tesla C2070 and C2050[OL].http://www.nvidia.cn/object/product_tesla_C2050C2070_cn.html.
  • 5Linpack on NVIDIA GPU[OL].http://tech.it168.com/a2010/0723/1081/000001081479_all.shtml.
  • 6NVIDLA.Fermi compute architecture whitepaper[Z].2009.
  • 7HPC Challenge Benchmarks[OL].http://icl.cs.utk.edu/hpcc.

引证文献3

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部