system design · system-design

Design a GPU Cluster Scheduler (MIG + Gang)

Bin packing, MIG slicing, gang scheduling, K8s GPU operator, preemption, QoS. NVIDIA signature SDI.

hard4hcudakubernetessystem-design

Theory

Explanation

Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.

Distributed training jobs need all GPUs simultaneously (gang scheduling); inference jobs need fractional GPUs (MIG slicing). Scheduler honors both, packs tight, considers topology (NVLink islands).

Scheduler tracks GPU inventory per host + MIG profiles supported. Requests specify N GPUs + topology hints (same node? NVLink island?) + priority. Gang scheduling waits until all N can be co-scheduled. Pre-emption: low-priority inference jobs evict for high-priority training. Topology-aware: prefer GPUs on same NVLink island for tensor-parallel workloads. MIG: slice H100 into 7 partial GPUs for inference fairness.

When to use

Internal ML platforms, cloud GPU services, HPC clusters.

When not to

Single-tenant single-GPU workloads.

flowchart TB
  Req[Job · 8 GPUs same NVLink + gang] --> Sched[GPU Scheduler]
  Sched --> Inv[(GPU Inventory · per host + MIG)]
  Sched --> Filter[Topology + MIG filter]
  Filter --> Score[Score · packing + locality]
  Score --> Gang{All slots available?}
  Gang -->|yes| Place[Place job]
  Gang -->|no| Wait[Wait / backfill smaller]
  Preempt[High-pri arrives] --> Evict[Evict low-pri]
  Evict --> Sched

Key insights

Gang scheduling avoids partial allocation deadlock, all-or-nothing.
MIG converts one H100 into 7 isolated GPUs for inference SLA.
Topology hints matter, TP=8 jobs must land on NVLink island, else 10x slower.
Backfill: small jobs fit gaps while large gang job waits, keeps utilization high.
Preemption requires job checkpointing, only training jobs that checkpoint regularly survive.