system design · system-design

Design Google Drive / Dropbox

File chunking, dedup, sync, version history, sharing/permissions. Classic SDI asked at all Mag7.

hard4hgeneralsystem-design
Ask GPTConfidence

Theory

Explanation

Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.

A 1GB file changing 1KB should sync only the changed block. Content-defined chunking + block-level dedup achieves that, plus saves storage when many users upload the same file.

Client splits file into variable-size chunks via rolling hash (Rabin fingerprint). Each chunk hashed; client sends manifest of chunk hashes to server. Server replies with which chunks it lacks → client uploads only those. File metadata (path, perms, version) in NewSQL (Spanner); blocks in object store. Sync: clients hold local manifest, poll for server version, fetch new manifests, download missing blocks, apply patches. Sharing = ACL on metadata; permissions checked at every API call.

When to use

File sync, code repos (Git is the same idea), backup, mobile photo sync.

When not to

Single-user single-device storage, overhead not worth it.

flowchart LR
  Client([Client]) -->|manifest of hashes| API[Upload API]
  API -->|missing chunks| Client
  Client -->|chunks| Blob[(Blob Store)]
  API --> Meta[(Metadata DB · Spanner)]
  Meta --> Notif[Notification Service]
  Notif -.push.-> OtherDevice[Other Devices]
  OtherDevice -->|fetch manifest| API

Key insights

  • Content-defined chunking aligns chunks to content boundaries, small edits invalidate few chunks.
  • Block dedup across users requires deterministic chunking + crypto hashes.
  • Manifest is the unit of versioning; blocks are immutable + content-addressed.
  • Sync uses long-poll or websocket for instant updates; periodic poll fallback.
  • ACL checks must happen server-side; never trust client to filter shared files.