Your Events Will Be Duplicated: Idempotency at Scale

Every event-driven system built on at-least-once delivery shares a dirty secret: duplicates aren’t the exception, they’re the contract. You can build the most elegant stream processing pipeline in the world, but the moment a worker crashes mid-processing, a network packet vanishes, or a consumer group rebalances, that same event will arrive again. And again.

The question isn’t if you’ll see duplicates. It’s whether your system will survive them.

A developer recently laid out their architecture on Reddit: a push notification platform using Redis Streams and Python, consuming domain events like UserRegistered, PortfolioCreated, and GoalCompleted. They’d already implemented an idempotency key approach, generate a key per event, store processing state in Redis, reject duplicates, and shunt persistently failing events to a DLQ. Solid foundation. But the comments revealed just how many ways this can go sideways.

Let’s dig into what actually works when you’re processing millions of events and can’t afford to process any of them twice.

The Four Horsemen of Duplicate Events

Before you can defend against duplicates, you need to understand where they come from. In practice, there are four scenarios that will reliably produce them:

Scenario	How It Happens	Frequency
Worker crash after processing, before ACK	Event is processed, state is saved, but the acknowledgment never reaches the broker	High
Network failure during processing	A timeout causes the sender to retry while the handler is still working	Very High
Consumer rebalancing	Partitions are reassigned, and in-flight messages get reprocessed by new consumers	Medium
Retry queue redelivery	A previous failure lands the event in a retry queue that re-delivers it hours later	Medium

A visual representation tracking the flow of duplicate events through a Kafka pipeline with idempotency keys highlighting claim-check processing stages. — Duplicate events traversing the idempotency gate in a Kafka stream.

The developer’s original post nails the core tension: “At-least-once delivery guarantees reliability, but it also guarantees duplicates.” This isn’t a bug in your message broker. It’s a feature, and one you need to design for explicitly.

The Idempotency Key: Your First Line of Defense

The most common pattern is straightforward: generate a unique identifier per event, and before processing, check whether that identifier has been seen before. If it has, skip the event. If it hasn’t, claim it and process.

The challenge is making that check-and-claim atomic. A naive implementation, check first, then insert, inevitably races.

# DON'T DO THIS: check-then-act is a race condition
existing = await redis.get(f"processed:{event.id}")
if existing:
    return  # duplicate, skip

# Two concurrent workers can BOTH reach this line
await process_event(event)
await redis.set(f"processed:{event.id}", "1")

The developer on Reddit described the same problem: “I implement idempotency at the application level” because their message broker (Redis Streams) doesn’t provide end-to-end exactly-once processing. The solution is to make the claim atomic.

With Redis: SET key value NX EX ttl

import aioredis

redis = await aioredis.from_url("redis://localhost")

async def handle_event(event):
    # Atomic claim: SET if Not eXists, with TTL
    claimed = await redis.set(f"idempotency:{event.id}", "processing", 
                              nx=True, ex=604800)  # 7 days TTL

    if not claimed:
        # Already seen or in progress
        return

    try:
        await process_event(event)
        await redis.set(f"idempotency:{event.id}", "done", keepttl=True)
    except Exception:
        await redis.delete(f"idempotency:{event.id}")
        raise

With PostgreSQL: INSERT ... ON CONFLICT DO NOTHING

CREATE TABLE idempotency_keys (
    event_id TEXT PRIMARY KEY,
    status TEXT NOT NULL DEFAULT 'processing',
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- In your application code:
INSERT INTO idempotency_keys (event_id, status)
VALUES ($1, 'processing')
ON CONFLICT (event_id) DO NOTHING
RETURNING event_id;
-- If rowCount === 0, it's a duplicate

The key insight, as one commenter on the Reddit thread pointed out: “You handle it just like you would handle it with REST API. I don’t see any difference.” They’re right. The atomicity requirement is identical.

The Storage Trade-Off: Redis vs. PostgreSQL

Both approaches work. Both have trade-offs that matter at scale.

Concern	Redis	PostgreSQL
Durability	Volatile, evictions and failovers lose state	Durable, survives restarts
TTL Management	Built-in with EX/PX	Must be pruned manually
Atomicity with side effects	Impossible, separate operations	Can be in same transaction
Performance	Sub-millisecond	~millisecond with proper indexing
Operational complexity	Another data store to manage	Reuse existing database

The developer using Redis Streams made a deliberate choice. They needed “at least once delivery, consumer groups, message replay, low operational cost, pending messages recovery.” Redis met those requirements. But as another commenter pushed back: “How about high availability?”

This is the rub. Redis’s durability story for idempotency keys is weak. If your Redis cluster loses a node and fails over, you might lose dedup state. The duplicate you thought you’d blocked shows up again. For notification platforms, that’s annoying but survivable. For payment systems, that’s a double charge.

The strongest approach combines both: Redis for the hot path (low-latency dedup for the common case) and PostgreSQL as the source of truth. As Google Cloud’s idempotency guide notes, “Memorystore (Redis) is perfect for short-term caching of keys… while Cloud Spanner provides the ‘five nines’ of availability and strong consistency needed for high-stakes financial transactions.”

The Crash-After-Claim Problem

Both patterns share a subtle but devastating failure mode: you claim the key, then your process dies before finishing the work. The retry arrives, sees the key as claimed, and skips the event. The event is now lost forever.

The robust fix is to store a status alongside the key:

async def handle_event_with_status(event):
    claimed = await redis.set(f"idempotency:{event.id}", "processing", 
                              nx=True, ex=604800)

    if not claimed:
        # Check the status of the existing claim
        status = await redis.get(f"idempotency:{event.id}")
        if status == "done":
            return  # Already processed
        elif status == "processing":
            # Check if the claim has timed out
            # Re-claim if the original worker likely crashed
            # This requires a more sophisticated mechanism
            # like a lease with heartbeat
            pass
        return

    # ... process the event

With PostgreSQL, you can sidestep the problem entirely by putting the claim and the side effects in the same transaction. If the processing fails, the transaction rolls back, and the claim disappears too. This is the strongest guarantee available, but it limits you to single-database transactions.

Beyond the Door Check: Making Side Effects Idempotent

The developer on Reddit concluded with a crucial insight: “The worker can help reduce duplicate processing, but the strongest guarantee comes from making the domain operation itself idempotent.”

This is the difference between convenience and correctness. A dedup check at the door handles 99% of duplicates cheaply. But idempotent side effects catch whatever slips through, an expired TTL, a lost Redis key, a replay from six months later.

UPSERTS instead of INSERT:

-- Instead of:
INSERT INTO notifications (user_id, message, event_id)
VALUES ($1, $2, $3);

-- Use:
INSERT INTO notifications (user_id, message, event_id)
VALUES ($1, $2, $3)
ON CONFLICT (event_id) DO NOTHING;

Absolute states instead of deltas:

# DON'T: Increment, applying twice doubles the increment
user.balance += amount

# DO: Set absolute state, applying twice is harmless
user.last_transaction_amount = amount
user.transaction_status = 'completed'

Pass idempotency keys downstream: If your event handler calls an external API (say, Firebase Cloud Messaging for push notifications), forward your event ID as that API’s idempotency key. This ensures the entire chain deduplicates consistently.

Where This Gets Controversial

Here’s the take that might ruffle some feathers: idempotency at the application level is a necessary evil, but the ultimate solution is infrastructure that provides exactly-once semantics.

For years, we’ve been told that at-least-once is the pragmatic choice and exactly-once is either impossible or too expensive. But that’s changing. Google Cloud Pub/Sub now offers exactly-once delivery for Pull-based subscriptions. Kafka’s exactly-once semantics (EOS) have been production-ready for years. The message broker landscape is shifting.

The comment thread on that Reddit post captures the tension perfectly:

One commenter asked: “Are you sending events or messages?” The distinction matters because events are facts about what happened, they can’t be undone, while messages are commands that change state.

When you’re reacting to domain events like UserRegistered or GoalCompleted, processing the same event twice means double notifications, double analytics events, double everything. The event itself is immutable truth. The problem is how many times you act on it.

This is precisely why the outbox pattern has become so popular. By writing events to a database table in the same transaction as your business data, and then using Change Data Capture (CDC) to stream those events to a message broker, you get the atomicity of a database transaction with the scalability of a streaming platform. The outbox table becomes your dedup state, and CDC ensures exactly-once publication to the broker.

Practical Rules for Production

After building and debugging these systems, here’s what actually matters:

Deploy idempotency at the application level. Don’t trust your broker’s exactly-once guarantees alone. Infrastructure changes. Bugs happen. Your application should handle duplicates regardless.
Use the provider’s event ID as your key. Stripe’s evt_, Shopify’s X-Shopify-Webhook-Id, and GitHub’s X-GitHub-Delivery GUID are all stable across retries. Scope the key by source: stripe:evt_1OxYzA... rather than bare IDs.
Set TTLs aligned with your retry window. Stripe retries for up to 3 days. Shopify for 48 hours. Set your TTL to the retry window plus a safety margin. Don’t store dedup state forever.
Return 200 for duplicates. Duplicates are a success from the provider’s perspective. Returning an error just schedules more retries.
Test it. As one guide on webhook deduplication puts it: “Capture a real event, replay it five times, and verify one set of side effects and five green responses. Then change the event ID and confirm it processes as new.”

The Hard Truth

Event-driven architectures are complex. The developer who started this thread built a solid system with Redis Streams, idempotency keys, and a DLQ. But every system has weak links. Redis’s availability. The gap between an INSERT and an ACK. The race between two workers claiming the same event.

The teams that survive production incidents aren’t the ones with perfect infrastructure. They’re the ones that assumed failure, and built idempotency into every layer.

Your worker will crash. Your network will glitch. Your consumer group will rebalance. Your events will be duplicated. The question is whether your system will handle it gracefully, or whether you’ll be explaining to your users why they received seventeen notifications for one registration.

The choice is yours. The duplicates are guaranteed.