system design · system-design
Design Amazon Recommendation System
Candidate generation → ranking → re-ranking pipeline. Feature stores, A/B testing, real-time inference. Amazon signature SDI.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
Out of 100M products, surface 20 a user will buy. Direct ranking 100M items is impossible at p99 < 200ms. Two-stage funnel: cheap retrieval narrows to ~1000 candidates, expensive ranker scores them, lightweight re-ranker enforces business rules.
Stage 1 Retrieval (two-tower neural net): user embedding × item embedding via ANN index (FAISS/ScaNN). Stage 2 Ranking (gradient-boosted trees or DLRM): fully-featured per-pair scoring. Stage 3 Re-ranking (rule-based): diversity, freshness, exclude already-bought. Feature store (online + offline parity) feeds both training and inference. Logging emits impressions + clicks → trains next model. A/B framework allocates traffic + tracks guardrail metrics.
When to use
Any catalog-scale recommendation: shopping, video, news, music, dating.
When not to
Sub-100-item catalogs, manual curation wins. Strict explainability requirements, neural models opaque.
flowchart LR
User([User]) --> API[Reco API]
API --> Retrieval{{Two-Tower ANN}}
Retrieval --> Cands[~1000 candidates]
Cands --> Ranker{{GBDT / DLRM}}
Ranker --> Top[Top 100]
Top --> Rerank[Diversity + Rules]
Rerank --> Response([20 items])
FS[(Feature Store)] -.online.-> Ranker
Logs[[Impressions + Clicks]] --> Train[Offline Training]
Train --> Model[(Model Registry)]
Model -.deploy.-> RankerKey insights
- Two-stage funnel cuts compute by 100,000x while retaining quality.
- Feature store online/offline parity prevents train-serve skew, the #1 silent bug.
- A/B testing must control for novelty effects; bake-in observation windows ≥7 days.
- Cold-start users handled by content-based features (category, brand) until interaction signal accumulates.
- Re-ranking is where business policy lives (no adult to minors, no expired, regional rules).