期刊文献+

LU分解在众核结构仿真器上的指令级调度研究 被引量:5

Study on Instruction Scheduling of LU Decomposition on Many-core Architecture Simulator
原文传递
导出
摘要 随着集成电路工艺的发展,众核处理器体系结构逐渐成为计算机体系结构设计者的研究热点。众核体系结构通过任务级的并行来提升整个处理器的性能。然而,指令级的并行性仍然是众核设计者需要认真考虑的问题。对浮点运算效率和加速比进行了形式化描述,验证了进行指令级调度的必要性。对处理器核内流水线进行详细分析,指出了指令级调度的一般性问题。提出了在众核结构上使用指令级调度和软件流水的方法。针对Splash2程序集里的LU分解算法,使用众核结构的硬件支持,在Scratched Pad Memory(SPM)上给出了调度指令的方案。在众核仿真器Godson-T上仿真了经过指令级调度后的算法,当使用64个线程处理512×512的矩阵时,程序性能达到调度前性能的4倍。 With the development of the technology of integrated circuit,many-core architecture has become the research focus.The task level parallelism improves the performance of applications on many-core architecture.However,the instruction level parallelism is still the important issue that computer architectures designer must handle.The float efficiency and speedup were formalized and the necessity of instruction level scheduling was verified.The pipeline in the core was analyzed in details and the general problems of pipeline were pointed out.The instruction scheduling and software pipeline method were proposed.For the LU decomposition in Splash2,with the hardware support,the method on Scratched Pad Memory was simulated.The experiments show that the speedup can achieve 4 when the matrix is 512×512 and the number of threads is 64.
出处 《系统仿真学报》 CAS CSCD 北大核心 2011年第12期2603-2610,共8页 Journal of System Simulation
基金 国家“九七三”重点基础研究发展规划项目(2005CB321600) 国家自然科学基金重点项目(60736012) 国家自然科学基金(61070025) 国家“八六三”高技术研究发展计划项目基金(2009AA01Z103) 国家杰出青年科学基金(60925009) 国际合作欧盟MULTICUBE项目(FP7-216693) 北京市自然科学基金(4092044)
关键词 计算机体系结构 众核 加速比 指令级并行 LU分解 computer architecture many-core speedup instruction level parallelism LU decomposition
  • 相关文献

参考文献25

  • 1Bousias K, Hasasneh N, Jesshope C. Instruction Level Parallelism through Microthreading~A Scalable Approach to Chip Multiprocessors [J]. The Computer Journal (S0010-4620), 2006, 49(2): 211-233.
  • 2Chang M, Lai F. Efficient Exploitation of Instruction-Level Parallelism for Superscalar Processors by the Conjugate Register File Scheme [J]. IEEE Transactions on Computers (S0018-9340), 1996, 45(3): 278-293.
  • 3Zhong H, Mehrara M. Uncovering Hidden Loop Level Parallelism in Sequential Applications [C]//The 14th International Symposium on High-Performance Computer Architecture (HPCA) (S1530-0897), Salt Lake City, USA. USA: IEEE Press, 2008: 290-301.
  • 4Gschwind M. The Cell Broadband Engine: Exploiting multiple levels of parallelism in a chip multiprocessor [J]. International Journal of Parallel Programming (S0885-7458), 2007, 35(3): 233-262.
  • 5Yu A. The future of microprocessors [J]. IEEE Micro (S0272-1732), 1996, 16(6): 46-53.
  • 6Asanovic K, Bodik R, Catanzaro B C. The Landscape of Parallel Computing Research: A View from Berkeley [EB/OL]. (2006-12-18) [2009-12-03]. www.eecs.berkeley, edu/Pubs/TechRpts/2006/EECS- 2006- 183. html.
  • 7Woo S C, Ohara M, Torrie E, et al. The SPLASH-2 Programs: Characterization and Methodological Considerations [C]// Proceedings of the 22nd International Symposium on Computer Architecture (S1063-6897), Santa Margherita Ligure, Italy. USA: IEEE Press, June 1995: 24-36.
  • 8Venetis I E, Gao G R. Optimizing the LU Benchmark for the Cyclops-64 Architecture [R]. USA: CAPSL Technical Memo 75, University o f Delaware, 2007:3 - 10.
  • 9Petitet A, Whaley R C, Dongarra J, et al. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers [EB/OL]. (2008-9-10) [2009-12-03]. http://www.netlib.org/benchmark/hpl.
  • 10Yeager K. The MIPS R10000 Superscalar Microprocessor [J]. IEEE Micro (S0272-t732), 1996, 16(2): 28-41.

二级参考文献80

  • 1周喜明,吴悦,杨洪斌.面向对象的离散事件仿真核的设计和实现[J].计算机工程,2004,30(16):82-84. 被引量:1
  • 2王雷,王旭,李巍.计算机仿真系统生成工具SIMS的设计与实现[J].系统仿真学报,2005,17(6):1392-1395. 被引量:2
  • 3张纯,毛菁霞,张如鸿,吴百锋,彭澄廉,陈泽文,孙晓光.基于图形硬件加速的体绘制关键技术综述[J].计算机工程与设计,2005,26(7):1732-1734. 被引量:5
  • 4王志刚,周学海,李曦,杨君.xpTools:代码压缩系统定制环境[J].小型微型计算机系统,2006,27(7):1250-1253. 被引量:1
  • 5余洁,李曦,周学海,王志刚.可重定向的周期精确模拟器生成环境研究[J].小型微型计算机系统,2007,28(1):166-171. 被引量:2
  • 6Wentzlaff D, Griffin P, Hoffmann H, Bao L, Edwards B, Ramey C, Mattina M, Miao C C, Brown J F, Agarwal A. On-chip interconnection architecture of the Tile processor. IEEE Micro, 2007, 27(5): 15-31
  • 7Tan G, Fan D, Zhang J, Russo A, Gao G R. Experience on optimizing irregular computation for memory hierarchy in manycore architecture//Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Salt Lake City, Utah, USA, 2008: 279-280
  • 8Long G P, Fan D R, Zhang J C, Song F L, Yuan N, Lin W. A performance model of dense matrix operations on manycore architectures//Proceedings of the European Conference on Parallel and Distributed Computing. 2008:120-129
  • 9Lamport L. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, 1979, 28(9): 690-691
  • 10Adve S V, Gharachorloo K. Shared memory consistency models: A tutorial. IEEE Computer, 1996, 29(12): 66-76

共引文献17

同被引文献57

引证文献5

二级引证文献31

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部