Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix Multiplication (SDDMM) are important sparse operators in scientific computing and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with superior computing power, which is promising to boost the performance of matrix operators to a higher level. However, due to the irregularity of sparse data, it is difficult to deliver practical speedups on TCUs. To this end, we propose FlashSparse, a novel approach to bridge the gap between sparse workloads and the TCU architecture. Specifically, FlashSparse minimizes the sparse granularity for SpMM and SDDMM on TCUs through a novel swap-and-transpose matrix multiplication strategy. Benefiting from the minimum sparse granularity, the computation redundancy is remarkably reduced while the computing power of TCUs is fully utilized. Besides, FlashSparse is equipped with a memory-efficient thread mapping strategy for coalesced data access and a sparse matrix storage format to save memory footprint. Extensive experimental results on H100 and RTX 4090 GPUs show that FlashSparse sets a new state-of-the-art for sparse matrix multiplications (geometric mean 5.5x speedup over DTC-SpMM and 3.22x speedup over RoDe).

Tue 4 Mar

Displayed time zone: Pacific Time (US & Canada) change

14:00 - 15:20
Session 8: Tensor Cores (Session Chair: Jeffrey Vetter)Main Conference at Acacia D
14:00
20m
Talk
FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores
Main Conference
Jinliang Shi Beijing University of Posts and Telecommunications, Shigang Li Beijing University of Posts and Telecommunications, Youxuan Xu Beijing University of Posts and Telecommunications, Rongtian Fu Beijing University of Posts and Telecommunications, Xueying Wang Beijing University of Posts and Telecommunications, Tong Wu Beijing University of Posts and Telecommunications
14:20
20m
Talk
Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores
Main Conference
Haisha Zhao Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, Li San Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, Jiaheng Wang Renmin University of China, Chunbao Zhou Computer Network Information Center, Chinese Academy of Sciences, Jue Wang Computer Network Information Center, Chinese Academy of Sciences, Zhikuang Xin Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, lishunde Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, ZhiQiang Liang Computer Network Information Center, Chinese Academy of Sciences, Zhijie Pan Hangzhou Dianzi University, Fang Liu Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, Yan Zeng Hangzhou Dianzi University, Yangang Wang Computer Network Information Center, Chinese Academy of Sciences, Xuebin Chi Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences
14:40
20m
Talk
BerryBees: Breadth First Search by Bit-Tensor-CoresDistinguished Paper AwardBest Artifact Award
Main Conference
Yuyao Niu Barcelona Supercomputing Center (BSC) - Universitat Politècnica de Catalunya (UPC), Marc Casas Barcelona Supercomputing Center
15:00
20m
Talk
FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core Units
Main Conference
Haozhi Han Microsoft Research; Peking University, Kun Li Microsoft Research, Wei Cui Microsoft Research, Donglin Bai Microsoft Research, Yiwei Zhang UCAS; Microsoft Research, Liang Yuan Chinese Academy of Sciences, Yifeng Cheng Peking University, Yunquan Zhang Zhang, Ting Cao Microsoft Research, Mao Yang Microsoft Research