Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores (PPoPP 2025 - Main Conference)

Who

Haisha Zhao, Li San, Jiaheng Wang, Chunbao Zhou, Jue Wang, Zhikuang Xin, lishunde , ZhiQiang Liang, Zhijie Pan, Fang Liu, Yan Zeng, Yangang Wang, Xuebin Chi

Track

PPoPP 2025 Main Conference

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 4 Mar 2025 14:20 - 14:40 at Acacia D - Session 8: Tensor Cores (Session Chair: Jeffrey Vetter)

Abstract

General-purpose Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel in scientific computing and deep learning. The emergence of new matrix computation units such as Tensor Cores (TCs) brings more opportunities for SpMM acceleration. However, in order to fully unleash the power of hardware performance, systematic optimization is required. In this paper, we propose Acc-SpMM, a high-performance SpMM library on Tensor cores, with multiple optimizations, including data-affinity-based reordering, memory-efficient compressed format, a high-throughput pipeline, and adaptive sparsity-aware load balancing. In contrast the to state-of-the-art SpMM kernels on various NVIDIA GPU architectures with a diverse range of benchmark matrices, Acc-SpMM achieves significant performance improvements, on average 3.24x (up to 5.11x) speedup on RTX 4090, on average 1.36x (up to 5.49x) speedup on A800, and on average 1.16x (up to 3.60x) speedup on H100 over cuSPARSE.

Haisha Zhao

Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences

Li San

Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences

Jiaheng Wang

Renmin University of China

Chunbao Zhou

Computer Network Information Center, Chinese Academy of Sciences

Jue Wang

Computer Network Information Center, Chinese Academy of Sciences

Zhikuang Xin

Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences

lishunde

Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences

ZhiQiang Liang

Computer Network Information Center, Chinese Academy of Sciences

Zhijie Pan

Hangzhou Dianzi University

Fang Liu

Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences

Yan Zeng

Hangzhou Dianzi University

Yangang Wang

Computer Network Information Center, Chinese Academy of Sciences

Xuebin Chi

Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 4 Mar
Displayed time zone: Pacific Time (US & Canada) change

14:00 - 15:20	Session 8: Tensor Cores (Session Chair: Jeffrey Vetter)Main Conference at Acacia D

14:00 20m Talk		FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores Main Conference Jinliang Shi Beijing University of Posts and Telecommunications, Shigang Li Beijing University of Posts and Telecommunications, Youxuan Xu Beijing University of Posts and Telecommunications, Rongtian Fu Beijing University of Posts and Telecommunications, Xueying Wang Beijing University of Posts and Telecommunications, Tong Wu Beijing University of Posts and Telecommunications
14:20 20m Talk		Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores Main Conference Haisha Zhao Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, Li San Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, Jiaheng Wang Renmin University of China, Chunbao Zhou Computer Network Information Center, Chinese Academy of Sciences, Jue Wang Computer Network Information Center, Chinese Academy of Sciences, Zhikuang Xin Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, lishunde Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, ZhiQiang Liang Computer Network Information Center, Chinese Academy of Sciences, Zhijie Pan Hangzhou Dianzi University, Fang Liu Computer Network Information Center, Chinese Academy of Sciences,University of Chinese Academy of Sciences, Yan Zeng Hangzhou Dianzi University, Yangang Wang Computer Network Information Center, Chinese Academy of Sciences, Xuebin Chi Computer Network Information Center, Chinese Academy of Sciences; University of Chinese Academy of Sciences
14:40 20m Talk		BerryBees: Breadth First Search by Bit-Tensor-CoresDistinguished Paper AwardBest Artifact Award Main Conference Yuyao Niu Barcelona Supercomputing Center (BSC) - Universitat Politècnica de Catalunya (UPC), Marc Casas Barcelona Supercomputing Center
15:00 20m Talk		FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core Units Main Conference Haozhi Han Microsoft Research; Peking University, Kun Li Microsoft Research, Wei Cui Microsoft Research, Donglin Bai Microsoft Research, Yiwei Zhang UCAS; Microsoft Research, Liang Yuan Chinese Academy of Sciences, Yifeng Cheng Peking University, Yunquan Zhang Zhang, Ting Cao Microsoft Research, Mao Yang Microsoft Research