system design · system-design
Design a Distributed File System (GFS / Colossus)
Master + chunkservers, replication, append-only writes, fault recovery. Foundation of modern data infra.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
Files broken into fixed-size chunks (64MB). Master holds metadata; chunkservers hold data. Replication for durability; append-only writes for simplicity at scale; lease-based primary chunkserver coordinates writes.
Master node: namespace + file→chunks→replicas mapping. Chunkservers: hold 64MB chunks on local disk. Reads go directly to chunkserver (master only serves locations). Writes: client gets primary replica from master, primary orders mutations, secondaries follow. Replication factor 3 default. Heartbeats from chunkservers; master re-replicates on detected loss. Colossus (successor) replaces single master with sharded metadata + Reed-Solomon erasure coding for storage efficiency.
When to use
Foundation for compute frameworks (MapReduce, Spark), data lakes, log archival.
When not to
Small files (master metadata bloat). Low-latency random writes, use NewSQL.
flowchart TB Client([Client]) -->|metadata request| Master[Master · namespace + chunk map] Client -->|read chunk| CS1[Chunkserver 1] Client -->|write chunk| Primary[Primary Replica] Primary --> Sec1[Secondary 1] Primary --> Sec2[Secondary 2] CS1 -.heartbeat.-> Master Sec1 -.heartbeat.-> Master Sec2 -.heartbeat.-> Master Master --> Repair[Re-replication on failure]
Key insights
- Append-only avoids the hardest writes: in-place updates. Most workloads (logs, MapReduce intermediates) accept this.
- Master metadata fits in RAM if average chunk is 64MB, 1PB cluster = 16M chunks × ~64B metadata = 1GB.
- Replication for durability; erasure coding (Reed-Solomon) replaces it in Colossus for storage efficiency.
- Lease-based primary avoids needing consensus per write, primary holds 60s lease.
- Single master is the historical bottleneck; Colossus shards metadata via consistent hashing.