system design · system-design · domain

Design Fault-Tolerant Distributed Training Pipeline

Checkpointing cadence, replica resurrection, NCCL recovery, elastic training. Survival in long-running jobs.

expert4hcudaml-aisystem-design

Theory

Explanation

Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.

Training a 70B model on 1024 GPUs for 30 days = guaranteed failures. Hardware fails, NCCL hangs, network partitions occur. The pipeline must detect, recover, and continue without losing days of work.

Async checkpointing every N steps writes shard per rank to parallel filesystem. NCCL watchdog detects hangs (any rank stuck >5min). On failure: scheduler removes failed node, training shrinks to surviving nodes (elastic) or pauses for replacement. Resume from latest checkpoint. Optimizer state must be reshardable when DP size changes. Determinism + replayability for reproducible debugging.

When to use

Multi-day training runs at scale.

When not to

Short jobs (< 1 day), recovery overhead exceeds expected failures.

flowchart TB
  Training[Training Loop] --> Checkpoint{Every N steps}
  Checkpoint -->|async| PFS[(Parallel Filesystem · sharded ckpt)]
  Training --> Watch[NCCL Watchdog]
  Watch -->|hang| Detect[Detect failed rank]
  Detect --> Sched[GPU Scheduler]
  Sched --> Elastic{Elastic or replace?}
  Elastic -->|elastic| Resize[Reshard optimizer · resume]
  Elastic -->|replace| NewNode[Provision new node]
  NewNode --> Resume[Resume from checkpoint]
  Resize --> Resume

Key insights

Async checkpoint via copy-to-pinned-memory + background flush; never block training.
Sharded checkpoint: each rank writes its own state, parallel I/O.
NCCL watchdog catches silent hangs; without it, GPUs idle forever.
Elastic training reshardable when DP changes, optimizer state aware of rank count.
Determinism via fixed seeds + deterministic kernels; bisects training divergence bugs.