system design · system-design · domain

Design Distributed Training for 10B+ Param Model on 128 H100/B200

Tensor + pipeline + data parallelism, NCCL all-reduce, FP8/BF16, ZeRO/FSDP, checkpointing. NVIDIA signature SDI.

expert6hcppcudaml-aisystem-design

Theory

Explanation

Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.

10B params do not fit on one H100 (80GB). Split the model: tensor parallel splits each layer across GPUs; pipeline parallel pipelines layers across GPU groups; data parallel replicates the whole setup and averages gradients. Stack all three to fit and run efficiently on 128 GPUs.

3D parallelism: TP×PP×DP = total GPUs. Tensor parallel: each transformer matmul split (e.g., 8-way), needs all-reduce per layer (NCCL nvlink). Pipeline parallel: model stages on separate GPU groups, micro-batch pipeline avoids bubble. Data parallel: replicate across additional GPU sets, all-reduce gradients per step. ZeRO/FSDP shards optimizer states + gradients across data-parallel ranks to fit larger models. Mixed precision (FP8 on Hopper) doubles throughput. Checkpointing every N steps to recover from failures.

When to use

Foundation model training, multi-billion-parameter fine-tuning.

When not to

Models fitting on single GPU. Inference (different optimization).

flowchart TB
  subgraph TP · 8-way Tensor Parallel
    G1[GPU 1] <-->|nvlink| G2[GPU 2]
    G2 <-->|nvlink| G3[GPU 3]
    G3 <-->|nvlink| G4[GPU 4]
    G4 <-->|nvlink| G5[GPU 5]
    G5 <-->|nvlink| G6[GPU 6]
    G6 <-->|nvlink| G7[GPU 7]
    G7 <-->|nvlink| G8[GPU 8]
  end
  Stage1[Pipeline Stage 1 · 8 GPUs] --> Stage2[Stage 2 · 8 GPUs]
  Stage2 --> Stage3[Stage 3 · 8 GPUs]
  Stage3 --> Stage4[Stage 4 · 8 GPUs]
  DP1[DP Replica 1 · 32 GPUs] -.NCCL all-reduce gradients.- DP2[DP Replica 2 · 32 GPUs]
  DP2 -.NCCL.- DP3[DP Replica 3]
  DP3 -.NCCL.- DP4[DP Replica 4]
  Checkpoint[(Checkpoint Store · async)] -.flush every N steps.- DP1

Key insights

TP within node (nvlink 900GB/s); PP between nodes (slower interconnect); DP across additional sets.
FSDP/ZeRO-3 shards even parameters across data-parallel ranks, memory scales linearly with DP size.
Pipeline bubble = idle time at pipeline start/end; minimize with micro-batches.
Checkpoint async on background stream; never block training step.
NCCL all-reduce is bandwidth-bound, topology-aware (ring vs tree) matters.