system design · system-design · domain
Design Distributed Training for 10B+ Param Model on 128 H100/B200
Tensor + pipeline + data parallelism, NCCL all-reduce, FP8/BF16, ZeRO/FSDP, checkpointing. NVIDIA signature SDI.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
10B params do not fit on one H100 (80GB). Split the model: tensor parallel splits each layer across GPUs; pipeline parallel pipelines layers across GPU groups; data parallel replicates the whole setup and averages gradients. Stack all three to fit and run efficiently on 128 GPUs.
3D parallelism: TP×PP×DP = total GPUs. Tensor parallel: each transformer matmul split (e.g., 8-way), needs all-reduce per layer (NCCL nvlink). Pipeline parallel: model stages on separate GPU groups, micro-batch pipeline avoids bubble. Data parallel: replicate across additional GPU sets, all-reduce gradients per step. ZeRO/FSDP shards optimizer states + gradients across data-parallel ranks to fit larger models. Mixed precision (FP8 on Hopper) doubles throughput. Checkpointing every N steps to recover from failures.
When to use
Foundation model training, multi-billion-parameter fine-tuning.
When not to
Models fitting on single GPU. Inference (different optimization).
flowchart TB
subgraph TP · 8-way Tensor Parallel
G1[GPU 1] <-->|nvlink| G2[GPU 2]
G2 <-->|nvlink| G3[GPU 3]
G3 <-->|nvlink| G4[GPU 4]
G4 <-->|nvlink| G5[GPU 5]
G5 <-->|nvlink| G6[GPU 6]
G6 <-->|nvlink| G7[GPU 7]
G7 <-->|nvlink| G8[GPU 8]
end
Stage1[Pipeline Stage 1 · 8 GPUs] --> Stage2[Stage 2 · 8 GPUs]
Stage2 --> Stage3[Stage 3 · 8 GPUs]
Stage3 --> Stage4[Stage 4 · 8 GPUs]
DP1[DP Replica 1 · 32 GPUs] -.NCCL all-reduce gradients.- DP2[DP Replica 2 · 32 GPUs]
DP2 -.NCCL.- DP3[DP Replica 3]
DP3 -.NCCL.- DP4[DP Replica 4]
Checkpoint[(Checkpoint Store · async)] -.flush every N steps.- DP1Key insights
- TP within node (nvlink 900GB/s); PP between nodes (slower interconnect); DP across additional sets.
- FSDP/ZeRO-3 shards even parameters across data-parallel ranks, memory scales linearly with DP size.
- Pipeline bubble = idle time at pipeline start/end; minimize with micro-batches.
- Checkpoint async on background stream; never block training step.
- NCCL all-reduce is bandwidth-bound, topology-aware (ring vs tree) matters.