摘要
该文介绍了基于GPGPU-sim的多kernel环境下GPGPU性能优化实验方法,旨在为初学者开展多kernenl场景下GPGPU性能优化研究提供实验方法参考,也能为计算机系统结构教学提供案例。文中重点分析讨论了基于GPGPU-sim模拟器、多kernel场景下的一种自适应线程块调度方法的改进思想、实验方法及过程,还对GPGPU的微系统结构、GPGPU-sim模拟器及源代码结构进行了介绍。实验结果表明,该文阐述的实验方法可行,相对于基准方法,该文提出的改进策略可以提升多kernel场景下GPGPU的执行效率。
[Objective]With the rapid development and continuous improvement of the parallel computing architecture of general-purpose graphics processing units(GPGPUs),their computing power has been significantly improved,making them essential in high-performance and high-throughput applications.However,as tasks increase in number and complexity,multi-kernel execution environments face serious challenges.Therefore,optimizing GPGPU performance in multi-kernel environments is crucial.Scholars often use GPGPU-sim as the main tool for studying GPGPU performance optimization methods.Despite this,there is currently no comprehensive guide for conducting GPGPU performance optimization experiments using GPGPU-sim in multi-kernel environments,posing difficulties for beginners in experimental verification and analysis in this area.Furthermore,while the round-robin(RR)scheduling strategy ensures fair resource utilization,it may lead to scheduling delays between multiple kernels in concurrent execution environments.This study aims to provide key experimental methods for beginners to optimize GPGPU performance in multi-kernel concurrent execution environments and offer valuable case references for teaching computer architecture.[Methods]First,the article provides a detailed introduction to the GPGPU architecture and explores the source code structure of the GPGPU-sim simulator,providing readers with relevant background knowledge.It then comprehensively analyzes and discusses the improvement ideas and adaptive thread block(ATB)algorithm of the proposed ATB scheduling strategy.The article elaborates on the process of modifying the GPGPU-sim source code to implement the ATB strategy scheduling of multi-kernel thread block execution.In addition,to ensure that beginners can easily replicate the relevant experiments,the article provides a detailed explanation of the configuration parameters of GPGPU-sim and modifications to the testing program.[Results]This article compares the ATB strategy with the benchmark RR thread block scheduling method,analyzing the experimental results on system performance,shared memory utilization,register utilization,and memory access efficiency.From the perspective of system performance,the ATB strategy enables concurrent execution of multiple kernels,effectively improving resource utilization on the GPGPU,thereby significantly improving the overall execution performance.Compared to RR,ATB's execution efficiency can be improved by up to 76%,with an average system performance improvement of 45%.In terms of shared memory and register utilization,the ATB strategy allows threads from multiple kernels to concurrently access GPGPU resources,improving the utilization of these resources.Shared memory usage under ATB increased by a maximum of 84%,compared to RR,with an average increase of 54%.Register usage saw an average increase of 29%,with a maximum increase of 49%.Regarding memory access efficiency,ATB allows threads from different kernels to access different storage resources,effectively reducing the probability of threads competing for the same resource.Compared to the RR strategy,the pipeline stagnation cycle of ATB decreased by an average of 5%,while the warp waiting data cycle was reduced by a maximum of 44%and an average of 29%.Overall,compared to the benchmark method,the ATB proposed in this paper effectively improves the efficiency of concurrent execution of multiple kernels and GPGPU performance.[Conclusions]This article provides an in-depth analysis and discussion of GPGPU performance optimization methods using GPGPU-sim in an environment including multiple kernels.It successfully designs and implements an ATB scheduling strategy.By adopting an improved ATB scheduling strategy in the GPGPU-sim simulator,the study successfully achieved concurrent execution of multiple kernels and verified the effectiveness of this strategy in improving GPGPU performance through experimental data.This work not only provides detailed and feasible experimental methods for beginners but also offers important reference cases for teaching computer architecture.
作者
张军
魏继桢
沈凡凡
谭海
何炎祥
ZHANG Jun;WEI Jizhen;SHEN Fanfan;TAN Hai;HE Yanxiang(School of Information Engineering,East China University of Technology,Nanchang 330013,China;School of Information Engineering,Nanjing Audit University,Nanjing 211815,China;Computer School,Wuhan University,Wuhan 430072,China)
出处
《实验技术与管理》
CAS
北大核心
2024年第7期87-93,共7页
Experimental Technology and Management
基金
国家自然科学基金项目(62162002,61662002,61902189)
江西省自然科学基金项目(20212BAB202002)
江苏省高等学校基础科学(自然科学)研究项目(22KJA520004)。