期刊文献+

面向龙芯3B1500体系结构的DGEMM函数优化 被引量:3

Optimization of DGEMM Function for Loongson3B1500 Architecture
下载PDF
导出
摘要 双精度普通矩阵乘法DGEMM函数是高性能计算基础软件BLAS库中最重要的第三级函数.本文针对龙芯3B1500处理器体系结构的特点,利用保留的物理内存与大页技术减少内存页的换进换出以及TLB缺失,通过龙芯128位向量访存指令和向量乘加指令实现矩阵乘法的向量化运算,同时针对矩阵乘法中各矩阵的访存特点设计合理分块策略,并运用3B1500的cache锁机制将重复利用率高的分块锁在cache中以减少cache缺失,最后针对矩阵A和B的预取时间大于计算时间这一问题,设计了一种新的矩阵预取算法.该预取算法通过增大核心计算的计算量,将矩阵A和B的预取时间全部掩藏在计算中,并且通过ld指令与$0寄存器的配合使用来实现对C矩阵的预取.优化后的DGEMM函数无论在单线程和多线程时的性能都达到了理论峰值的80%以上. DGEMM function is the most important of Level3 functions in BLAS and completes the multiplication of two double-precision matrixes. In this paper,we optimize DGEMM on Loongson3B1500. Using the reserved physical memory and larger page,we can reduce probabilities which the memory page is swapped in and out of physical memory and TLB miss; Utilize Loongson two 128-bit instructions of vector fetching and vector multiplication to realize vector computing of matrix multiplication; Design reasonable partition strategy according to the characteristics of the memory access of each matrix in matrix multiplication,and let the high repeat block in to locked cache with the using of cache locking mechanism on Loongson3B1500 to reduce the cache missing; Design the new prefetching algorithm for the original prefetching time of matrix A and B is greater of its calculate time,by expanding the amount of core calculating to hide the pre-fetching time of A and B in computing time,and use ld instruction and register $ 0 to pre-fetch matrix C.About all,the optimized DGEMM function has get more than 80% of theoretical performance in both one thread and multi-threads.
出处 《小型微型计算机系统》 CSCD 北大核心 2014年第7期1523-1527,共5页 Journal of Chinese Computer Systems
基金 国家"八六三"高技术研究发展计划项目(2012AA01A30904)资助 广东省院士工作站建设项目(2012B090500020)资助
关键词 龙芯3B1500处理器 BLAS DGEMM 矩阵乘法 数据预取 Loongson3B1500 processor BLAS DGEMM matrix multiplication data prefetching
  • 相关文献

参考文献3

二级参考文献27

  • 1Vangal S R, Howard J, Ruhl G, et al. An 80-tile sub- 100-W teraFLOPS processor in 65-nm CMOS [J]. IEEE Journal of Solid-State Circuits, 2008, 43(1) : 29- 41.
  • 2Kahle J A, Day M N, Hofstee H P, et al. Introduction to the cell multiprocessor[J]. IBM Journal of Research and Development, 2005, 49 (4/5) 589-604:.
  • 3Kapasi U, Dally W J, Rixner S, et al. The imagine stream processor [C]// Proceedings of the 2002 International Confernce on Computer Design. Freiburg, Germany: IEEE Press, 2002: 282-288.
  • 4Waingold E, Taylor M, Sarkar V, et al. Baring it all to software., raw maehines[J]. IEEE Computer, 1997, 30(9) : 86-93.
  • 5Sankaralingam K, Nagarajan R, McDonald R, et al. Distributed microarchitectural protocols in the TRIPS prototype processor [C]// Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. Washington, USA: IEEE Computer Society, 2006: 480-491.
  • 6Gunnels J A, Henry G M, van de Geijn R A. A family of high performance matrix multiplication algorithms [C]// Proceedings of the International Conference on Computational Science - Part I. London, UK: Springer, 2001: 51-60.
  • 7Goto K. van de Geijn R A. On reducing TLB misses in matrix multiplication[R]. CS-TR-02-55, Department of Computer Scienees, The University of Texas at Austin, 2002.
  • 8Goto K. van de Geijn R A. Anatomy of high- performance matrix multiplication [ J ]. ACM Transactions on Mathematical Software, 2008, 34(3): Article 12(1-25).
  • 9Gunnels J, Lin C, Morrow G, et al. A flexible class of parallel matrix multiplication algorithms [C]// First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing. Washington, USA: IEEE Computer Society, 1998, 12: 110-116.
  • 10Marker B, van Zee F G, Goto K, et al. Toward sealable matrix multiply on multithreaded architectures [C]// Proceedings of the 13th International European Conference on Parallel and Distributed Computing. Rennes, France: ACM Press, 2007: 748-757.

共引文献19

同被引文献7

引证文献3

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部