Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion (PPoPP 2025 - Main Conference)

Who

Hulin Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, Dazhao Cheng

Track

PPoPP 2025 Main Conference

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 3 Mar 2025 16:20 - 16:40 at Acacia D - Session 4: Memory (Session Chair: Dong Li)

Abstract

The Mixture of Experts (MoE) architecture enhances model quality by scaling up model parameters. However, its development is hindered in distributed training scenarios due to significant communication overhead and expert load imbalance. Existing methods, which only allow for coarse-grained overlapping of communication and computation, slightly alleviate communication costs but at the same time, they introduce a notable impairment of computational efficiency. Furthermore, current approaches to addressing load imbalance often compromise model quality

We introduce CCFuser, a novel framework designed for efficient training of MoE models. CCFuser replaces the costly All2All operations typical in MoE architectures with efficient GPU shared memory access. This allows for the concurrent computation of local and remote data within a fused kernel, achieving substantially higher compute FLOPS for GEMM operations. Additionally, CCFuser addresses load imbalance with a resource-efficient expert reassignment algorithm, which optimizes the use of computational resources in expert reassignment through equivalent graph transformations without sacrificing statistical accuracy. By integrating these optimizations, CCFuser significantly enhances GPU utilization efficiency. Experiments conducted on A100 servers show that CCFuser outperforms state-of-the-art methods like FastMoE and FasterMoE by an average of 2.96x and 2.48x, respectively, achieving a maximum speedup of 4.34x.

Hulin Wang

Yaqi Xia

Wuhan University

Donglin Yang

Nvidia Corporation

Xiaobo Zhou

University of Macau

Dazhao Cheng

WuHan University