The Scale Reality Check
Let’s establish the baseline before we discuss the architecture. A production-grade maps platform handles:
- 50 million GPS location updates per second (active navigators, background apps, fleet vehicles, OEM telemetry)
- 5 billion route requests per day (~58,000 per second at peak)
- 20 billion tile requests per day with <50ms latency at the CDN edge
- 500 million POI records queried with <100ms latency
The ingestion bandwidth alone is 5 GB/second of raw GPS data. If you’re still thinking about batching this with microbatching strategies, you’ve already lost. By the time your Spark job finishes aggregating, that traffic jam has cleared and reformed three times.

Kafka Partitioning: Geography as Load Balancer
The first mistake teams make is partitioning by user ID or timestamp. In geo-streaming, you partition by S2 Cell at level 12 (~4,000 partitions globally). Why? Because all GPS pings from the same geographic area must land on the same partition to enable downstream aggregation without shuffling terabytes of data across your Flink cluster.
Each GPS ping carries: {user_id_hash, lat, lng, speed_kmh, heading_degrees, timestamp, source_type}. At 50M updates/sec, a standard RDBMS dies instantly. Even Kafka requires careful tuning, this isn’t your typical “log of user actions” but a spatiotemporal firehose where ordering matters within geographic boundaries but not globally.
This is where event-driven architecture meets its match. The AWS DevOps Agent approach of correlating events across microservices works for container orchestration, but geospatial streaming requires you to correlate events across space itself.
Flink: Not Your Father’s Stream Processor
Apache Flink here isn’t just “processing streams”, it’s performing Hidden Markov Model map matching at scale. Raw GPS coordinates (lat/lng) must be snapped to road segments with 5-10 meter accuracy. A naive implementation costs ~0.5ms per point, requiring 25,000 CPU cores just for map matching.
The optimization? Geometric fast-pathing. For straight road segments without ambiguity (no overpasses, no parallel highways), use lightweight geometric snapping (nearest road within 10m). Reserve the expensive HMM/Viterbi algorithm for complex interchanges and urban canyons. This reduces HMM calls by ~80%, cutting CPU requirements to ~5,000 cores, still massive, but manageable.
Flink processes these map-matched points with 30-second tumbling windows per road segment:
1. Collect speed observations for segment ID B-C in the window
2. Compute median speed (robust to GPS noise and stopped vehicles at traffic lights)
3. Write to Redis: segment:B-C → {speed_kmh: 10, samples: 30, confidence: 0.92}
The exactly-once semantics here aren’t just nice-to-have, they’re critical. Double-counting a speed sample could mark a highway as congested when it’s actually flowing freely, triggering unnecessary reroutes for thousands of drivers.
Redis: The Hot Path Bottleneck
Here’s where generic stream processing advice falls apart. Most tutorials tell you to dump aggregated data into a data lake or analytics warehouse. For geo-routing, latency is existential. The routing engine must read real-time segment speeds in sub-millisecond time to compute traffic-aware ETAs.
Key: segment:{segment_id}
Value: { "speed_kmh": 45, "free_flow_kmh": 65, "samples": 23,
"confidence": 0.92, "updated_at": 1710087600 }
Every 30 seconds, routing servers poll Redis for changed segments and recalculate affected shortcuts using Customizable Contraction Hierarchies (CCH). This isn’t a cache, it’s the operational state that determines whether your algorithm routes drivers through a gridlocked intersection.
When Redis fails (and it will), the system degrades gracefully to historical traffic patterns, “historical accuracy” of within 15% instead of the real-time 10% target. But during that 10-15 second failover window, your ETA predictions are essentially guessing based on last Tuesday’s traffic.
The Routing Miracle: Preprocessing 500 Million Nodes
The stream processing pipeline feeds into a routing engine that solves an impossible problem: finding the shortest path through a graph with 500 million nodes and 1.2 billion edges in under a millisecond.
You can’t run Dijkstra’s algorithm here, it would take ~2 seconds per query. A* search improves this to ~400ms, still 400x too slow. The solution is Contraction Hierarchies (CH): preprocess the graph offline by removing low-importance nodes and adding “shortcut” edges that preserve shortest-path distances.
| Algorithm | Preprocessing | Query Time | Nodes Explored |
|---|---|---|---|
| Dijkstra | None | ~2 seconds | ~2,000,000 |
| A* | None | ~400ms | ~500,000 |
| Contraction Hierarchies | 4-6 hours (global) | <1ms | ~500 |
The CCH variant separates graph structure (preprocessed weekly) from edge weights (updated every 30 seconds from Redis). When traffic changes, only affected shortcut weights are recalculated incrementally in <500ms.
The road graph itself is stored as a custom binary adjacency array, memory-mapped on routing servers. No deserialization on startup, the OS pages data from disk as accessed. A regional shard (~5-12 GB) loads in ~60 seconds via lazy memory-mapping.
Geospatial Indexing: S2 Cells vs. The World
Why S2 Cells instead of Geohash or H3? Because boundary discontinuities are fatal at scale. Two points 1 meter apart can have completely different Geohash prefixes if they straddle a cell boundary, requiring 8 separate range scans to avoid missing results.
S2 projects Earth onto six cube faces using a Hilbert space-filling curve, giving each cell a 64-bit integer ID. Spatially nearby points get numerically nearby IDs. A circular query (“coffee shops within 2km”) translates to 4-8 contiguous integer ranges, fast, exact, and boundary-artifact-free.
H3 (hexagonal indexing) is used only for analytics and heat maps, where uniform neighbor distance matters more than database range scan performance.
Vector Tiles: The Rendering Pipeline
The traffic data doesn’t just feed routing, it renders as colored road segments on the map. This uses Mapbox Vector Tiles (MVT), binary Protocol Buffers encoding geometric primitives (points, lines, polygons) in a 4096×4096 integer grid.
Vector tiles are 10-50x smaller than raster tiles (~15 KB vs ~150 KB) and enable dynamic styling without re-generation. The traffic tile builder reads from Redis, encodes speed-colored segments, and writes to S3 with a 30-second CDN TTL.
The coordinate math is elegant: tile (z,x,y) identifies the geographic bounding box via Web Mercator projection, while extent coordinates (0-4096) position features within that tile using cheap integers instead of float64 lat/lng pairs.
Failure Scenarios: When the Pipeline Bursts
Kafka Broker Failure: Partitions on the failed broker go stale for 15-30 seconds. The routing engine uses cached speeds (120s TTL), then falls back to historical patterns. ETA accuracy degrades, but routes still compute.
Flink Checkpoint Failure: The job restarts from the last checkpoint. Traffic data freshness gaps equal the checkpoint interval (30-60 seconds), but idempotent writes to Redis prevent double-counting.
Re-Routing Stampede: A highway incident triggers 50,000+ simultaneous re-route requests. Mitigation involves pre-computing detour routes for affected corridors and serving cached alternatives with random jitter (0-5s) to spread the WebSocket burst.
Full Region Failure: DNS failover redirects to adjacent regions within 30-60 seconds. Routing servers in the new region lazy-load graph shards from S3 (~60 seconds to full recovery).
Building This Without Google-Scale Infrastructure
Not everyone has 5,000 Flink cores lying around. For startup scale (1-10M DAU):
- Skip Kafka: Write directly to Redis at <1M updates/sec using pipelining
- Use OSRM: Open-source CH implementation handling 10K routes/sec on a single 32GB server
- Batch aggregation: Cron job every 60 seconds instead of Flink streaming (90s latency vs 30s, acceptable for most users)
- PostgreSQL + PostGIS: Handles POI search, tile metadata, and geocoding until you hit 10M+ records per table
A single 64GB server running OSRM + PostgreSQL + Redis + Nginx can serve 50K DAU with tile caching and basic traffic.
The Convergence of State and Stream
The architecture described here, Kafka for ingestion, Flink for aggregation, Redis for hot state, and CCH for routing, represents a specific evolution in stream processing. It’s not just about moving data, it’s about maintaining queryable geospatial state at planetary scale with millisecond latency.
Whether Fluss or other alternatives will make Kafka obsolete for lakehouse architectures remains to be seen, but for real-time geo-intelligence, Kafka’s partitioning model and replay capabilities remain unmatched.
The critical insight? In geo-streaming, your architecture isn’t defined by your event choreography, but by your state relationships. The structure of the road graph, the S2 cell coverings, and the Redis hot paths matter more than the elegance of your Flink topology. Events modify state, they don’t define it.
When 50 million devices are screaming their coordinates into your pipeline every second, you learn quickly that handling events is not the same as architecting for them. The difference between a working system and a traffic catastrophe is whether your state model can answer “what is true right now” faster than a driver can miss their exit.




