摘要
双精度普通矩阵乘法DGEMM函数是高性能计算基础软件BLAS库中最重要的第三级函数.本文针对龙芯3B1500处理器体系结构的特点,利用保留的物理内存与大页技术减少内存页的换进换出以及TLB缺失,通过龙芯128位向量访存指令和向量乘加指令实现矩阵乘法的向量化运算,同时针对矩阵乘法中各矩阵的访存特点设计合理分块策略,并运用3B1500的cache锁机制将重复利用率高的分块锁在cache中以减少cache缺失,最后针对矩阵A和B的预取时间大于计算时间这一问题,设计了一种新的矩阵预取算法.该预取算法通过增大核心计算的计算量,将矩阵A和B的预取时间全部掩藏在计算中,并且通过ld指令与$0寄存器的配合使用来实现对C矩阵的预取.优化后的DGEMM函数无论在单线程和多线程时的性能都达到了理论峰值的80%以上.
DGEMM function is the most important of Level3 functions in BLAS and completes the multiplication of two double-precision matrixes. In this paper,we optimize DGEMM on Loongson3B1500. Using the reserved physical memory and larger page,we can reduce probabilities which the memory page is swapped in and out of physical memory and TLB miss; Utilize Loongson two 128-bit instructions of vector fetching and vector multiplication to realize vector computing of matrix multiplication; Design reasonable partition strategy according to the characteristics of the memory access of each matrix in matrix multiplication,and let the high repeat block in to locked cache with the using of cache locking mechanism on Loongson3B1500 to reduce the cache missing; Design the new prefetching algorithm for the original prefetching time of matrix A and B is greater of its calculate time,by expanding the amount of core calculating to hide the pre-fetching time of A and B in computing time,and use ld instruction and register $ 0 to pre-fetch matrix C.About all,the optimized DGEMM function has get more than 80% of theoretical performance in both one thread and multi-threads.
出处
《小型微型计算机系统》
CSCD
北大核心
2014年第7期1523-1527,共5页
Journal of Chinese Computer Systems
基金
国家"八六三"高技术研究发展计划项目(2012AA01A30904)资助
广东省院士工作站建设项目(2012B090500020)资助