Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion
The Mixture of Experts (MoE) architecture enhances model quality by scaling up model parameters. However, its development is hindered in distributed training scenarios due to significant communication overhead and expert load imbalance. Existing methods, which only allow for coarse-grained overlapping of communication and computation, slightly alleviate communication costs but at the same time, they introduce a notable impairment of computational efficiency. Furthermore, current approaches to addressing load imbalance often compromise model quality
We introduce CCFuser, a novel framework designed for efficient training of MoE models. CCFuser replaces the costly All2All operations typical in MoE architectures with efficient GPU shared memory access. This allows for the concurrent computation of local and remote data within a fused kernel, achieving substantially higher compute FLOPS for GEMM operations. Additionally, CCFuser addresses load imbalance with a resource-efficient expert reassignment algorithm, which optimizes the use of computational resources in expert reassignment through equivalent graph transformations without sacrificing statistical accuracy. By integrating these optimizations, CCFuser significantly enhances GPU utilization efficiency. Experiments conducted on A100 servers show that CCFuser outperforms state-of-the-art methods like FastMoE and FasterMoE by an average of 2.96x and 2.48x, respectively, achieving a maximum speedup of 4.34x.
Mon 3 MarDisplayed time zone: Pacific Time (US & Canada) change
15:40 - 16:40 | |||
15:40 20mTalk | AC-Cache: A Memory-Efficient Caching System for Small Objects via Exploiting Access Correlations Main Conference | ||
16:00 20mTalk | Effectively Virtual Page Prefetching via Spatial-Temporal Patterns for Memory-intensive Cloud Applications Main Conference Yun Wang Shanghai Jiao Tong University, Liang Chen , Tianmai Deng Shanghai Jiao Tong University, Ben Luo Alibaba Group, Yibin Shen Alibaba Cloud, Zhixiang Wei Shanghai Jiao Tong University, Yixiao Xu Shanghai Jiao Tong University, Minglang Huang Shanghai Jiao Tong University, Zhengwei Qi Shanghai Jiao Tong University | ||
16:20 20mTalk | Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion Main Conference Hulin Wang , Yaqi Xia Wuhan University, Donglin Yang Nvidia Corporation, Xiaobo Zhou University of Macau, Dazhao Cheng WuHan University |