system design · system-design
Design Azure VM Scheduler (Bin Packing + Fault Domains)
VM allocation, bin packing, fault domains, SLA-aware. Microsoft signature systems SDI.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
Cloud has 1M+ physical hosts. VMs come in shapes (1vCPU/2GB to 96vCPU/768GB). Pack them onto hosts maximizing utilization while honoring constraints: fault-domain spread, update-domain spread, anti-affinity, GPU/memory needs, region/AZ rules.
Scheduler runs as cluster manager. Inventory of hosts + their free resources. Incoming VM request specifies shape + constraints. Greedy bin-pack with constraint satisfaction: filter feasible hosts → score by utilization, fault-domain spread, locality → place. For HA sets, spread across N fault domains + M update domains. Preemption for higher-priority VMs (e.g., burst capacity). Live migration moves VMs without restart when hosts need maintenance.
When to use
Cloud compute schedulers, Kubernetes schedulers, GPU clusters.
When not to
Bare-metal only with no virtualization.
flowchart TB Req[VM Request · shape + constraints] --> Sched[Scheduler] Sched --> Inv[(Host Inventory)] Sched --> Filter[Filter feasible] Filter --> Score[Score · util + spread + locality] Score --> Place[Place on host] Place --> Host[Host] Host --> Hyper[Hypervisor · creates VM] Maint[Maintenance Event] --> LM[Live Migration] LM --> NewHost[Target Host]
Key insights
- Constraint satisfaction is NP-hard but tractable in practice via heuristics.
- Fault domains protect against rack/PSU failure; update domains against rolling upgrades.
- Live migration uses pre-copy memory: iteratively copy dirty pages, brief stun-and-switch.
- Preemption supports spot instances, cheap capacity that can be reclaimed.
- Capacity planning operates on top of scheduler, knowing what VMs to provision next month.