Godson-T众核体系结构上的Broadcast性能优化被引量：3

An Optimization of Broadcast on Godson-T Many-Core System Architecture

下载PDF

导出

摘要 Godson-T是中国科学院计算技术研究所计算机系统结构重点实验室先进微系统组正在研制开发的适合于超深亚微米工艺实现的大规模片上众核系统.Godson-T片上存储的单端口结构节省了芯片面积但制约了共享数据的读取效率.直接在Godson-T上实现传统的Broadcast算法需要大量的同步互斥开销,无法达到很好的性能提升.基于Godson-T体系结构,对数据共享的重要并行算法Broadcast进行优化,提高了Godson-T体系结构下的数据共读的效率.主要采取了以下3项技术:消除大规模的线程同步,建立源地址到目的地址的映射表和用汇编语言实现Broadcast的核心部分.优化后Broadcast在小核数为32时即可达到5.8倍加速比. Godson-T is a large scale many-core system architecture to be implemented by ultra-deep submicron MOS technology under development by the Group of Advanced Microsystem in the Key Laboratory of Computer System and Architecture of the Institute of Computer Technology, Chinese Academy of Sciences. The single port design of Godson-T on-chip memory saves the chip＇s total area but limits the efficiency of data sharing. Broadcast is a basic parallel algorithm used to accelerate data sharing process, but implementing the traditional algorithm on Godson-T requires a large amount of synchronization and mutual exclusion expenses and therefore could not bring a good performance. Based on Godson-T system architecture, the authors optimize the important parallel algorithm Broadcast and enhance the efficiency of concurrent read. Three techniques are proposed for the optimization： eliminating bulk synchronization among threads, establishing mapping table between source addresses and destination addresses, and rearranging assembly instructions in Broadcast kernel. The first one reduces expenses of synchronizing a large amount of threads, the second one provides a quicker method for destination address search, and the last one fully makes use of the advantage of Godson-T architecture. The optimized Broadcast algorithm on Godson-T system architecture performs well; especially when core number is 32, the speedup of the algorithm can reach 5.8.

作者包尔固德李伟生范东睿杨扬马啸宇

机构地区北京交通大学软件学院中国科学院计算技术研究所计算机系统结构重点实验室北京交通大学计算机与信息技术学院

出处《计算机研究与发展》 EI CSCD 北大核心 2010年第3期524-531,共8页 Journal of Computer Research and Development

基金国家"九七三"重点基础研究发展规划项目基金(2005CB321600) 国家自然科学基金重点项目(60736012) 中国科学院计算技术研究所计算机系统结构重点实验室开放课题

关键词 Godson-T 众核 BROADCAST 同步互斥共读映射表加速比 Godson-T many-core Broadcast synchronization mutual exclusion concurrent read mapping table speedup

分类号 TP302 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献14

1中国科学院计算技术研究所先进微系统组.Godson-T编程手册[R].1版.北京:中国科学院计算技术研究所,2007.
2Parhami B. Introduction to Parallel Processing, ALgorithms and Architectures[M]. New York: Plenum Series in Computer Science, 1999.
3Tamura H, Tasaki F. Sengoku M, et al. Problems for a class of parallel distributed systems [C]//Proc of the IEEE ISCAS'05. Piseataway, NJ: IEEE, 2005:176-179.
4Nell O, Hurson All R. Energy-aware object retrieval from parallel broadcast channels [C]//Proc of the IDEAS'04. Washington, DC: IEEE Computer Society, 2004: 37-46.
5Gupta R o Vadhiyar S. Application-oriented adaptive MPI bcast for grids [C] //Proc of IPDPS'06. Piscataway, NJ: IEEE, 2006:10-19.
6Edwards C. Automation blues designers of large system-onchip products are keen to automate the process of pulling together the many cores that go into them, but the work has exposed fault lines between the ehipmaking companies and the tools vendors that would hope to profit from the move[J]. Electronics Systems and Software: 2007, 5(1): 16-19.
7Nikolopoulos D S, Back G, Tripathi J, et al. VT-ASOS: holistie system software customization for many cores [C]// Proc of the IEEE IPDPS'08. Piseataway. NJ: IEEE, 2008:1-5.
8Matthew B, Nell V, Zhang Yun, et al. Revisiting the sequential programming model for multi-core [C] //Proc of IEEE/ACM MICRO'07. Washington, DC: IEEE Computer Society, 2007:69-84.
9Yavatkar R. Industrial perspectives: Platform design challenges with many cores [C] //Proc of HPCA'06. Piseataway, NJ: IEEE, 2006:201-201.
10Lysecky R L, Vahid F, Givargis T D. Techniques for reducing read latency of core bus wrappers [C] //Proc of DATE'00. New York: ACM, 2000:84-91.

同被引文献55

1KHAN O, HOFFMANN H, LIS M, et al. ARCc : a case for an architecturally redundant cache-coherence architecture for large muhicores [ C]//Proc of the 29th IEEE International Conference on Computer Design. Washington DC : IEEE Computer Society ,2011:411-418.
2CHAIKEN D, FIELDS C, KURIHARA K, et al. Directory-based cache coherence in large-scale multiprocessors[ J]. Computer, 1990, 23(6) :49-58.
3Tilera Corporation. TILE64 processor product brief [ R/OL ]. (2008- 2009 ). http ://www. tilera, com/sites/default/files/productbriefs/ PB010_TILE64_Processor_A_v4. pdf.
4FENSCH C, CINTRA M. An OS-based alternative to full hardware coherence on tiled CMPs [ C ]//Proc of the 14th International Symposium on High Performance Computer Architecture. 2008:355-366.
5CELIO C P. Cache coherence strategies in a many-core processor [ D ]. Cambridge : Massachusetts Institute of Technology,2009.
6DUBEY P. Recognition, mining and synthesis moves computers to the era of tera[ R ]. [ S. l. ] :Intel Technology@ Corporation,2005.
7ZHOU Xiao-cheng, CHEN Hu, LUO Sai, et al. A case for software managed coherence in many-core processors [ C ]//Proc of the 2nd USENIX Workshop on Hot Topics in Parallelism. 2010.
8KELM J H, JOHNSON D R, TUOHY W, et al. Cohesion: a hybrid memory model for accelerators [ C ]//Proc of the 37th International Symposium on Computer Architecture. New York : ACM, 2010 : 429- 440.
9ROS A, ACACIO M E, GARCI J M:DiCo-CMP: efficient cache coherency in tiled CMP architectures [ C ]//Proc of IEEE International Symposium on Parallel and Distributed Processing. 2008 : 1-11.
10HARDAVELLAS N, FERDMAN M, FALSAFI B, et al. Reactive NUCA: near-optimal block placement and replication in distributed caches [ C ]//Proc of the 36th Annual International Symposium on Computer Architecture. New York : ACM, 2009 : 184-195.

引证文献3

1韩立敏,安建峰,高德远,樊晓桠,任向隆.众核处理器cache一致性研究综述[J].计算机应用研究,2012,29(11):4011-4016.
2吴际,谢冬青,唐琳.用于三维堆叠芯片的通用网络服务片设计[J].系统仿真学报,2014,26(11):2727-2733.
3胡森森,计卫星,王一拙,陈旭,付文飞,石峰.片上多核处理器Cache一致性协议优化研究综述[J].软件学报,2017,28(4):1027-1047. 被引量：5

二级引证文献5

1胡森森,陈皇吉.一种新颖的面向数据流量特征的片上网络设计[J].电讯技术,2018,58(5):583-587.
2屠雪真.一种优化的内核态文件发送方法[J].计算机与现代化,2019(5):13-18.
3屠雪真,屠要峰,陈小强.一种优化的Key-Value型NoSQL系统[J].计算机工程,2019,45(6):52-59. 被引量：4
4李思照,姜宏睿,韩新宇,赵欢.面向可重构片上网络的缓存可靠性分析[J].空间控制技术与应用,2022,48(1):58-65. 被引量：1
5陈益.基于单体多字与多体并行系统优化主存结构研究[J].企业科技与发展,2022(4):35-37.

1玤庆.组队啃书去有书共读[J].计算机应用文摘,2017,0(2):41-41.
2GeForceGTXTitan显示卡与RadeonHD7990显示卡对比测试新王者之争[J].新电脑,2013(6):78-81.
3殷春燕.ARM:走大小核异构之路,2013重新定义服务器价值[J].集成电路应用,2012(12):10-12. 被引量：1
4罗梅,万继光,詹玲.一种高性能的多光驱AUDIO光盘镜像及并行压缩算法[J].计算机工程与应用,2003,39(17):74-76. 被引量：2
5Gayle C.Ehrenman,伍颖文.Bob到底如何?[J].个人电脑,1995,0(8):13-14.
6韩萌,丁剑.BP算法的并行实现与设计[J].宁夏师范学院学报,2008,29(3):51-54.
7《信息化与信息管理实践之道》首发[J].中国信息化,2012(6):69-69.
8王锴.GK104 Core两款GTX 670显卡体验[J].微型计算机,2012(18):109-110.
9张步忠.Java语言中的线程同步互斥研究[J].安庆师范学院学报（自然科学版）,2011,17(4):106-110. 被引量：4
10钟莹.共读绘本从“句号”到“省略号”——以《我爸爸》为例谈绘本阅读的指导策略[J].教育,2016,0(38):72-73.

计算机研究与发展

2010年第3期

浏览历史

内容加载中请稍等...

Godson-T众核体系结构上的Broadcast性能优化被引量：3

参考文献14

同被引文献55

引证文献3

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

Godson-T众核体系结构上的Broadcast性能优化 被引量：3

参考文献14

同被引文献55

引证文献3

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

Godson-T众核体系结构上的Broadcast性能优化被引量：3