system design · system-design · domain
Design Fault-Tolerant Distributed Training Pipeline
Checkpointing cadence, replica resurrection, NCCL recovery, elastic training. Survival in long-running jobs.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
Training a 70B model on 1024 GPUs for 30 days = guaranteed failures. Hardware fails, NCCL hangs, network partitions occur. The pipeline must detect, recover, and continue without losing days of work.
Async checkpointing every N steps writes shard per rank to parallel filesystem. NCCL watchdog detects hangs (any rank stuck >5min). On failure: scheduler removes failed node, training shrinks to surviving nodes (elastic) or pauses for replacement. Resume from latest checkpoint. Optimizer state must be reshardable when DP size changes. Determinism + replayability for reproducible debugging.
When to use
Multi-day training runs at scale.
When not to
Short jobs (< 1 day), recovery overhead exceeds expected failures.
flowchart TB
Training[Training Loop] --> Checkpoint{Every N steps}
Checkpoint -->|async| PFS[(Parallel Filesystem · sharded ckpt)]
Training --> Watch[NCCL Watchdog]
Watch -->|hang| Detect[Detect failed rank]
Detect --> Sched[GPU Scheduler]
Sched --> Elastic{Elastic or replace?}
Elastic -->|elastic| Resize[Reshard optimizer · resume]
Elastic -->|replace| NewNode[Provision new node]
NewNode --> Resume[Resume from checkpoint]
Resize --> ResumeKey insights
- Async checkpoint via copy-to-pinned-memory + background flush; never block training.
- Sharded checkpoint: each rank writes its own state, parallel I/O.
- NCCL watchdog catches silent hangs; without it, GPUs idle forever.
- Elastic training reshardable when DP changes, optimizer state aware of rank count.
- Determinism via fixed seeds + deterministic kernels; bisects training divergence bugs.