system design · system-design
Design Google Drive / Dropbox
File chunking, dedup, sync, version history, sharing/permissions. Classic SDI asked at all Mag7.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
A 1GB file changing 1KB should sync only the changed block. Content-defined chunking + block-level dedup achieves that, plus saves storage when many users upload the same file.
Client splits file into variable-size chunks via rolling hash (Rabin fingerprint). Each chunk hashed; client sends manifest of chunk hashes to server. Server replies with which chunks it lacks → client uploads only those. File metadata (path, perms, version) in NewSQL (Spanner); blocks in object store. Sync: clients hold local manifest, poll for server version, fetch new manifests, download missing blocks, apply patches. Sharing = ACL on metadata; permissions checked at every API call.
When to use
File sync, code repos (Git is the same idea), backup, mobile photo sync.
When not to
Single-user single-device storage, overhead not worth it.
flowchart LR Client([Client]) -->|manifest of hashes| API[Upload API] API -->|missing chunks| Client Client -->|chunks| Blob[(Blob Store)] API --> Meta[(Metadata DB · Spanner)] Meta --> Notif[Notification Service] Notif -.push.-> OtherDevice[Other Devices] OtherDevice -->|fetch manifest| API
Key insights
- Content-defined chunking aligns chunks to content boundaries, small edits invalidate few chunks.
- Block dedup across users requires deterministic chunking + crypto hashes.
- Manifest is the unit of versioning; blocks are immutable + content-addressed.
- Sync uses long-poll or websocket for instant updates; periodic poll fallback.
- ACL checks must happen server-side; never trust client to filter shared files.