FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores (PPoPP 2025 - Main Conference)

Who

Jinliang Shi, Shigang Li, Youxuan Xu, Rongtian Fu, Xueying Wang, Tong Wu

Track

PPoPP 2025 Main Conference

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 4 Mar 2025 14:00 - 14:20 at Acacia D - Session 8: Tensor Cores (Session Chair: Jeffrey Vetter)

Abstract

Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix Multiplication (SDDMM) are important sparse operators in scientific computing and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with superior computing power, which is promising to boost the performance of matrix operators to a higher level. However, due to the irregularity of sparse data, it is difficult to deliver practical speedups on TCUs. To this end, we propose FlashSparse, a novel approach to bridge the gap between sparse workloads and the TCU architecture. Specifically, FlashSparse minimizes the sparse granularity for SpMM and SDDMM on TCUs through a novel swap-and-transpose matrix multiplication strategy. Benefiting from the minimum sparse granularity, the computation redundancy is remarkably reduced while the computing power of TCUs is fully utilized. Besides, FlashSparse is equipped with a memory-efficient thread mapping strategy for coalesced data access and a sparse matrix storage format to save memory footprint. Extensive experimental results on H100 and RTX 4090 GPUs show that FlashSparse sets a new state-of-the-art for sparse matrix multiplications (geometric mean 5.5x speedup over DTC-SpMM and 3.22x speedup over RoDe).

Jinliang Shi

Beijing University of Posts and Telecommunications

Shigang Li

Beijing University of Posts and Telecommunications

China

Youxuan Xu

Beijing University of Posts and Telecommunications

Rongtian Fu

Beijing University of Posts and Telecommunications

Xueying Wang

Beijing University of Posts and Telecommunications

Tong Wu

Beijing University of Posts and Telecommunications

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 4 Mar
Displayed time zone: Pacific Time (US & Canada) change

14:00 - 15:20	Session 8: Tensor Cores (Session Chair: Jeffrey Vetter)Main Conference at Acacia D

14:00 20m Talk		FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores Main Conference Jinliang Shi Beijing University of Posts and Telecommunications, Shigang Li Beijing University of Posts and Telecommunications, Youxuan Xu Beijing University of Posts and Telecommunications, Rongtian Fu Beijing University of Posts and Telecommunications, Xueying Wang Beijing University of Posts and Telecommunications, Tong Wu Beijing University of Posts and Telecommunications
14:20 20m Talk		Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores Main Conference Haisha Zhao Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, Li San Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, Jiaheng Wang Renmin University of China, Chunbao Zhou Computer Network Information Center, Chinese Academy of Sciences, Jue Wang Computer Network Information Center, Chinese Academy of Sciences, Zhikuang Xin Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, lishunde Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, ZhiQiang Liang Computer Network Information Center, Chinese Academy of Sciences, Zhijie Pan Hangzhou Dianzi University, Fang Liu Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, Yan Zeng Hangzhou Dianzi University, Yangang Wang Computer Network Information Center, Chinese Academy of Sciences, Xuebin Chi Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences
14:40 20m Talk		BerryBees: Breadth First Search by Bit-Tensor-CoresDistinguished Paper AwardBest Artifact Award Main Conference Yuyao Niu Barcelona Supercomputing Center (BSC) - Universitat Politècnica de Catalunya (UPC), Marc Casas Barcelona Supercomputing Center
15:00 20m Talk		FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core Units Main Conference Haozhi Han Microsoft Research; Peking University, Kun Li Microsoft Research, Wei Cui Microsoft Research, Donglin Bai Microsoft Research, Yiwei Zhang UCAS; Microsoft Research, Liang Yuan Chinese Academy of Sciences, Yifeng Cheng Peking University, Yunquan Zhang Zhang, Ting Cao Microsoft Research, Mao Yang Microsoft Research