system design · system-design · domain
Design Real-Time Autonomous-Vehicle Data Ingestion Pipeline
Vehicle → edge → cloud, sensor data compression, Kafka/Pulsar, hot/cold tiering, retraining loop.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
A single AV generates terabytes of sensor data per day. You cannot upload it all. Edge filtering selects rare/important events; selective upload streams them to cloud; data lake stores petabytes; auto-labeling produces training data; retraining loop closes the cycle.
Edge: per-vehicle compute filters sensor stream (camera/LiDAR/radar) keeping events where model disagrees with itself, hard scenarios, rare classes. Events compressed + uploaded via cellular when bandwidth available. Cloud ingest: Kafka/Pulsar ingest stream → object lake (S3-tier) with metadata DB. Auto-labeling pipeline runs heavy models offline (8x larger than on-vehicle) for ground truth. Retraining loop: filter labeled data → train next model → deploy via OTA → repeat.
When to use
AV, drone fleets, robotics fleets.
When not to
Single-vehicle hobby projects (no fleet scale).
flowchart LR Sensor[Camera + LiDAR + Radar] --> Edge[On-Vehicle Filter] Edge -->|interesting events| Upload[Selective Upload] Upload --> Ingest[Cloud Ingest · Kafka] Ingest --> Lake[(Data Lake · S3 / GCS)] Ingest --> Meta[(Metadata DB)] Lake --> Label[Auto-Labeling] Label --> Train[Training] Train --> Model[(Model Registry)] Model -->|OTA| Edge
Key insights
- Edge filtering is where the 1000x compression happens. Upload only what teaches the model new things.
- Disagreement-triggered upload: events where on-vehicle model differs from larger shadow model on same frame.
- Auto-labeling uses heavier models offline, cheaper than human annotation at scale.
- Retraining cadence balanced against deployment risk; staged rollout via OTA.
- Metadata indexes (timestamp, location, weather, events) enable querying rare scenarios.