Mario: Near Zero-cost Activation Checkpointing in Pipeline Parallelism (PPoPP 2025 - Main Conference)

Who

Weijian Liu, Mingzhen Li, Guangming Tan, Weile Jia

Track

PPoPP 2025 Main Conference

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 3 Mar 2025 17:20 - 17:40 at Acacia D - Session 5: Deep Neural Networks (Session Chair: Wei Niu)

Abstract

Large language models have to be trained in parallel due to their large number of parameters and significant memory footprint. Among various parallelism techniques, pipeline parallelism is widely adopted in inter-node scenarios with minimal communication overhead. However, state-of-the-art pipeline schemes lead to extra and imbalanced memory footprints, leaving room for further improvement. In this paper, we propose Mario, a pipeline optimizer that automatically tessellates activation checkpointing to existing pipeline schemes, enabling training larger models (or longer sequences) with less and balanced memory footprint across GPUs and improved GPU utilization. First, the activation recomputation can be effectively overlapped in the bubbles by moving it earlier in the execution process, thereby improving overall efficiency. With eliminated memory footprint through checkpointing, Mario allows for preposing more forward computation into the pipeline bubbles, making more room for further overlapping with greater flexibility, and thus exploiting the bubbles. Then we design a lightweight pipeline simulator to model execution behavior w/o|w/ Mario. Finally, we introduce an automatic pipeline scheduler specifically for Mario, capable of searching for near optimal combination of checkpointing and pipeline configurations within minutes. Experimental results on GPT3 and LLaMA2 models show that Mario can speed up existing state-of-the-art pipeline schemes (w/o|w/ checkpointing) including 1F1B, Chimera, and Interleave by 1.16x|1.57x on average. This work paves a new direction for effective low-cost pipeline training.

Weijian Liu

Institute of Computing Technology, Chinese Academy of Sciences

Mingzhen Li

Institute of Computing Technology, Chinese Academy of Sciences

Guangming Tan

Chinese Academy of Sciences(CAS)

China

Weile Jia

Institute of Computing Technology, Chinese Academy of Sciences

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 3 Mar
Displayed time zone: Pacific Time (US & Canada) change

17:00 - 18:00	Session 5: Deep Neural Networks (Session Chair: Wei Niu)Main Conference at Acacia D

17:00 20m Talk		FlashTensor: Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property Main Conference Runxin Zhong Tsinghua University, Yuyang Jin Tsinghua University, Chen Zhang Tsinghua University, Kinman Lei Tsinghua University, Shuangyu Li Tsinghua University, Jidong Zhai Tsinghua University
17:20 20m Talk		Mario: Near Zero-cost Activation Checkpointing in Pipeline Parallelism Main Conference Weijian Liu Institute of Computing Technology, Chinese Academy of Sciences, Mingzhen Li Institute of Computing Technology, Chinese Academy of Sciences, Guangming Tan Chinese Academy of Sciences(CAS), Weile Jia Institute of Computing Technology, Chinese Academy of Sciences
17:40 20m Talk		COMPSO: Optimizing Gradient Compression for Distributed Training with Second-Order Optimizers Main Conference Baixi Sun Indiana University Bloomington, Weijin Liu Stevens Institute of Technology, J. Gregory Pauloski University of Chicago, Jiannan Tian Indiana University, Jinda Jia Indiana University, Daoce Wang Indiana University, Boyuan Zhang Indiana University, Mingkai Zheng Department of Electrical and Computer Engineering at Rutgers University, Sheng Di Argonne National Laboratory, Sian Jin Temple University, Zhao Zhang , Xiaodong Yu Stevens Institute of Technology, Kamil A. Iskra Argonne National Laboratory, Pete Beckman Northwestern University and Argonne National Laboratory, Guangming Tan Chinese Academy of Sciences(CAS), Dingwen Tao Institute of Computing Technology, Chinese Academy of Sciences