system design · system-design
Design Amazon Order Processing (Saga + Idempotency)
Saga pattern across cart → payment → inventory → fulfillment. Tests distributed-transaction handling without 2PC.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
A single order touches 4-6 services. Distributed 2PC is impractical at scale. Saga orchestrates a sequence of local transactions, each with a compensating action. If any step fails, run compensations in reverse order to restore consistency.
Saga orchestrator holds the order state machine: CREATED → INVENTORY_RESERVED → PAYMENT_CHARGED → SHIPMENT_CREATED → COMPLETED. Each step is a local transaction on its service. Failures trigger compensations: SHIPMENT_FAILED → refund payment → release inventory → cancel order. Idempotency keys on every endpoint (UUID per order step) prevent duplicate effects on retry. Outbox pattern guarantees events emit atomically with DB write.
When to use
Any multi-service business transaction: orders, refunds, account onboarding, ride dispatch.
When not to
Pure read flows. Single-DB transactions (use native ACID).
sequenceDiagram
participant C as Client
participant O as Order Saga
participant I as Inventory
participant P as Payment
participant S as Shipment
C->>O: POST /orders (idempotency-key)
O->>I: reserve(order_id) [step 1]
I-->>O: reserved · TTL 15min
O->>P: charge(order_id, amount) [step 2]
P-->>O: paid
O->>S: create_shipment(order_id) [step 3]
alt happy path
S-->>O: shipped
O-->>C: 200 confirmed
else step 3 fails
O->>P: refund(order_id)
O->>I: release(order_id)
O-->>C: 500 + reason
endKey insights
- Orchestration (central coordinator) vs choreography (event chain), orchestration easier to reason about; choreography lower coupling. Pick based on team familiarity.
- Compensation is not always perfect inverse, a refund is not the same as un-charging. Document semantic differences.
- Idempotency keys must be persisted on the server, not just trusted from client.
- Outbox pattern: same DB transaction inserts business row + outbox row; relay reads outbox → publishes to Kafka. Prevents lost events.
- Sagas can run for hours (e.g. payment auth → capture delay). State machine must survive process restart.