Idempotency is a Retry Cache, Not a Guarantee

Idempotency is a Retry Cache, Not a Guarantee

When your second request carries a different meaning, your ‘solved’ problem explodes. A deep dive into metadata changes, stateful failures, and the illusion of safety.

Idempotency is a Retry Cache, Not a Guarantee

Idempotency seems simple. Echo the mantra: “Put an Idempotency-Key on the request. Store the response. Replay it on retry.” This gets you through the demo and lulls you into a false sense of security. But the “solved” part, the simple replay of a completed request, isn’t the hard part. The hard part begins the moment you realize the second request isn’t a simple replay. It’s an impostor in a familiar key’s clothing.

The hidden complexity lies in all the cases a naive replay cache can’t explain: concurrent retries, partial local success, external side-effects in an unknown state, and the most insidious of all, the same idempotency key arriving with a different canonical command. Your server’s response in that moment isn’t just about preventing duplicates, it’s a policy decision that defines your system’s integrity. Let’s strip back the platitudes and look at what actually fails.

WebSocket APIs in Postman
Even in a persistent WebSocket connection, idempotency and message ordering are critical concerns, as highlighted in testing guides for 2026. The chaos isn’t confined to HTTP.

The Illusion of the Simple Key

Consider this standard POST /payments flow:

POST /payments
Idempotency-Key: abc-123
Content-Type: application/json

{
  "accountId": "acc_1",
  "amount": "10.00",
  "currency": "EUR",
  "merchantReference": "invoice-7781"
}

Your server checks its store for abc-123. Nothing? Great. Create payment pay_789, store a success response, commit. A retry arrives, finds the key, replays pay_789. Simple.

Now watch the illusion shatter. A second request arrives:

POST /payments
Idempotency-Key: abc-123
Content-Type: application/json

{
  "accountId": "acc_1",
  "amount": "100.00",  // Different amount.
  "currency": "EUR",
  "merchantReference": "invoice-7781"
}

Same key. Different command. Is this a buggy client retrying a changed request? Is it a new, distinct payment that mistakenly reused a key? Your server has to decide, and its decision is a contract.

The naive replay strategy would silently return the stored response for the 10 EUR payment. The client asked for 100 EUR and got a confirmation for 10 EUR. That’s not idempotency, that’s a silent, catastrophic reinterpretation.

State is a Liar: The Gaps Between Your Guarantees

Idempotency is about the effect, not the request. HTTP semantics give you a starting point (PUT and DELETE are idempotent, POST is not), but your handler can still produce duplicate side effects: double audit logs, double emails, double charges from a payment provider.

A unique constraint like (account_id, merchant_reference) can prevent duplicate database rows, but it doesn’t give the client a correct result on retry. If the first request succeeded but the response was lost, a generic 500 on retry leaves the client in the dark. If the row exists but downstream events were published twice, your operation is not idempotent in any business-meaningful way.

For an operation to be truly idempotent, your server needs durable memory that answers three questions:
1. Who owns this key? (Scope: tenant, operation, key)
2. What did the first command mean? (A hash of the validated, canonicalized request)
3. What outcome can be replayed? (The final state: success, replayable failure, or “unknown”)

A minimal idempotency record in PostgreSQL might look like this:

create table idempotency_requests
(
    tenant_id       text        not null,
    operation_name  text        not null,
    idempotency_key text        not null,
    request_hash    text        not null, -- Crucial.
    status          text        not null,
    response_status int,
    response_body   jsonb,
    resource_type   text,
    resource_id     text,
    created_at      timestamptz not null,
    updated_at      timestamptz not null,
    expires_at      timestamptz not null,
    locked_until    timestamptz,
    primary key (tenant_id, operation_name, idempotency_key)
);

The request_hash is your memory of intent. Without it, you cannot distinguish a retry from a new operation. This is where many systems crumble. Hashing the raw JSON bytes isn’t enough, you must hash the validated command. Field order, whitespace, and defaults (channel: "web") must be normalized. Unknown fields that your API ignores? Decide if they are part of the hash. This fingerprint is a contract. Change it during a deploy, and old retries become new operations.

The Concurrency Trap and the State Machine You Didn’t Know You Needed

Two identical requests hit two API instances at nearly the same time. Both find no existing record. Both execute. A SELECT-then-INSERT pattern is broken, even if every single-threaded test passes. You need an atomic insert with a unique constraint, or a SET NX EX in Redis, to acquire ownership.

But an atomic lock is just the start. What is the lifecycle of an operation?

State Meaning Retry Behavior
IN_PROGRESS (fresh) First request is executing. Return 202 Accepted or 409 Conflict with Retry-After.
IN_PROGRESS (stale) First request crashed or timed out. Attempt recovery, do NOT start new execution.
COMPLETED Success. Replay stored response.
FAILED_REPLAYABLE Business rejection (e.g., INSUFFICIENT_FUNDS). Replay the failure.
FAILED_RETRYABLE Transient error. Allow a new attempt (maybe with a new key).
UNKNOWN_REQUIRES_RECOVERY Downstream state unknown. Trigger reconciliation, return a pending status.

This state machine becomes essential when side-effects cross boundaries. The classic failure path isn’t exotic:
1. API receives POST /payments.
2. Inserts idempotency row as IN_PROGRESS.
3. Creates local payment pay_789.
4. Calls payment provider, which succeeds.
5. API times out or crashes before recording the provider’s success.
6. Client retries with the same key.

Your database shows IN_PROGRESS. Your idempotency layer must not call the provider with a new payment ID. It must use the stable operation identity (provider_payment_pay_789) to query the provider and reconcile. If the provider has no query API, your idempotency guarantee ends at your system’s boundary. You’ve prevented a duplicate local row, but you may have moved money twice.

This is the heart of ensuring reliability in event-driven microservices, the hard part of distributed coordination isn’t the happy path, it’s managing partial failure across disparate systems.

Beyond HTTP: The Queue Consumer’s Silent Double-Count

HTTP gets the attention because the Idempotency-Key header is visible. But duplicate side effects often happen later, in queue consumers, outbox processors, and notification workers.

Your payment service publishes:


  {
    "eventId": "evt_100",
    "type": "PaymentCreated",
    "paymentId": "pay_789"
  }

A consumer receives it twice. Should it send two emails? Create two ledger entries? The deduplication key might be the eventId, the paymentId, or a business key like ledger_payment_pay_789. Marking a message as processed before performing the side-effect risks losing it forever if you crash. Performing the side-effect before marking risks a duplicate on retry.

The correct pattern is to make the side-effect itself idempotent or durable first: insert a ledger row with a unique constraint on (ledger_entry_type, source_payment_id). The second attempt violates the constraint and is ignored. This moves the idempotency responsibility to the business operation, not just the message consumption, a principle critical for maintaining transactional integrity in aggregates.

Real-World Blind Spots and Production Lessons

The theory meets brutal reality in systems like the WooCommerce Pixel Manager, which faces the idempotency problem across multiple external platforms. As detailed in a production-grade analysis, most plugins use a single boolean meta flag: _purchase_event_fired = 1. This works until you enable a second ad platform (Facebook, TikTok, Pinterest). The first platform to fire its event sets the flag, locking out all others. Their solution? Per-platform idempotency keys: _pmw_facebook_purchase_hit, _pmw_tiktok_purchase_hit. Each platform’s pipeline succeeds, fails, and retries independently. A Facebook outage doesn’t silently break TikTok reporting.

This pattern highlights a critical rule: Scope your keys. A broken client generating key abc-123 should only collide with its own operations, not another tenant’s. Your key should be scoped to (tenant_id, operation_name, idempotency_key).

Furthermore, as emphasized in modern backend curricula like the Spring Boot 0 to 100 course, idempotency is a core distributed systems pattern, often implemented using Redis SETNX (SET if Not eXists) for atomic ownership. But as we’ve seen, a Redis lock is merely an execution guard. It is not durable memory of the operation’s outcome. If the lock expires while the provider call is in flight, or the process dies after the provider succeeds but before storing the result, the system has no memory of what happened. Redis cannot tell you if money moved.

Building a System That Remembers

So, what does a robust idempotency layer require?

  1. Reject Key Reuse with Different Content: Same scoped key, different canonical command? This should be a hard 409 Conflict error. It catches client bugs early.
  2. Atomic Ownership Acquisition: Use a unique constraint or atomic INSERT ... ON CONFLICT DO NOTHING. The first insert wins.
  3. Hash the Validated Command, Not Raw Bytes: Normalize defaults, ignore transport metadata, and establish a stable canonical form.
  4. Model a State Machine: Distinguish IN_PROGRESS, COMPLETED, FAILED_REPLAYABLE, FAILED_RETRYABLE, and UNKNOWN.
  5. Plan for External Side-Effects: Use stable downstream operation IDs (provider_payment_pay_789) and have a reconciliation path for UNKNOWN states.
  6. Define Your Replay Contract: Will you replay the exact original response, or return the current state of the resource? Both are valid, pick one and document it. Schema changes (v2 vs v3 responses) make this critical.
  7. Set an Expiry Policy: Idempotency records can’t live forever. A 24-hour window is a product decision. Cleanup must handle stale IN_PROGRESS records carefully, deleting them can cause duplicates.

When Not to Build This Beast

Not every operation needs this machinery. For a read-only endpoint or an admin action where a duplicate is harmless and visible, a simple unique constraint on a business key might suffice. If duplicates are rare, easy to clean up, and don’t move money or notify humans, you might opt for a simpler mechanism. The cost is the durable memory and recovery complexity, not the header.

The second request is not a repeat until proven. The key is not the guarantee. The guarantee is that your server remembers the first operation precisely enough to replay it, reject a mismatch, or recover, instead of guessing. When that guarantee holds, you can start to manage the chaos of mitigating retry storms in distributed services and move towards a system that is not just available, but correct. Anything less is just a replay cache, waiting for the moment the metadata changes.

Share:

Related Articles