resilience distributed systemss

Task.WhenAll Is a Footgun for Distributed Systems

Why commonly used async patterns like Task.WhenAll fail in scenarios requiring resilience and partial results, especially when integrating slow or unreliable third-party providers.

•by Andre Banandre

Task.WhenAll Is a Footgun for Distributed Systems

Async/await promised to make concurrent programming intuitive. But in high-latency, multi-source integrations, it becomes a liability. The moment you call Task.WhenAll to parallelize requests across unreliable third-party providers, you’ve signed up for brittle failure modes and angry users.

The problem isn’t async itself, it’s that most developers treat it as a silver bullet for performance while ignoring its catastrophic failure modes in distributed environments.

The All-or-Nothing Lie of Task.WhenAll

Consider a flight search system that calls multiple airline APIs simultaneously. The intuitive approach looks like this:

var tasks = providers.Select(p => p.SearchFlightsAsync(query)).ToList();
var results = await Task.WhenAll(tasks);
return results.SelectMany(r => r.Flights);

This code fails in production for three critical reasons:

  1. It waits for the slowest provider – If one airline API takes 30 seconds, your user waits 30 seconds even if three other providers returned results in 2 seconds.
  2. It throws away partial results – If one provider times out, you get zero results instead of the valid responses from the other three.
  3. It provides no failure isolation – A single misbehaving provider can starve your connection pool and crash your service.

A developer on a software architecture forum recently shared their flight search war story, demonstrating exactly this pattern. Their initial implementation using Task.WhenAll blocked on the slowest provider and provided no graceful degradation when individual services failed. The solution required abandoning simple async patterns for a more sophisticated architecture combining scatter-gather for parallel calls, publish-subscribe for decoupling, correlation IDs for request tracking, and an aggregator pattern for merging partial responses safely.

The Eight Failure Modes of Network Requests

The AWS Builders’ Library article on distributed systems challenges reveals why simple async patterns break down. Every network request involves eight distinct steps that can fail independently:

  1. POST REQUEST: Client puts message onto network
  2. DELIVER REQUEST: Network delivers to server
  3. VALIDATE REQUEST: Server validates message
  4. UPDATE SERVER STATE: Server processes the request
  5. POST REPLY: Server puts reply onto network
  6. DELIVER REPLY: Network delivers to client
  7. VALIDATE REPLY: Client validates response
  8. UPDATE CLIENT STATE: Client processes the result

These aren’t theoretical edge cases, they’re the fundamental physics of distributed systems. Most developers write code that handles the happy path while ignoring the seven other failure modes. When you wrap these unreliable steps in Task.WhenAll, you’re multiplying your exposure to failure rather than containing it.

The article explains that in hard real-time distributed systems, a single method call like board.find("pacman") explodes into fifteen distinct failure scenarios across client and server. Your tidy async method becomes a distributed systems nightmare.

Where Task.WhenAll Completely Fails: Partial Results

The fatal flaw of Task.WhenAll is its binary nature: either all tasks succeed, or you get an exception. In multi-provider integrations, this is precisely the wrong model.

Imagine three API calls:
– Provider A: Returns 50 flights in 1.5 seconds
– Provider B: Returns 30 flights in 2 seconds
– Provider C: Times out after 10 seconds

Task.WhenAll throws an AggregateException and you lose 80 valid flights. Users don’t care that one provider failed, they care that they can’t book their trip.

The flight search system that actually worked in production took a different approach. It implemented an aggregator pattern that collected results as they arrived, with configurable timeouts per provider. The UI showed progressive results, updating in real-time as providers responded. This required abandoning Task.WhenAll for manual task management with Task.WhenAny and sophisticated cancellation logic.

Error Handling That Disappears in Async Boundaries

Structured error handling in distributed systems requires intentional design. The DEV Community article on real-world error handling demonstrates that throwing exceptions across async boundaries destroys context. By the time an exception bubbles up through Task.WhenAll, you’ve lost critical diagnostic information.

Instead, return structured error objects:

public async Task<SearchResult> SearchFlightsAsync(string query)
{
    try
    {
        var flights = await _provider.SearchAsync(query);
        return SearchResult.Success(flights);
    }
    catch (TimeoutException ex)
    {
        return SearchResult.Partial("Provider timeout", ex);
    }
    catch (Exception ex)
    {
        return SearchResult.Failure("Provider error", ex);
    }
}

This pattern preserves failure context and allows the aggregator to make intelligent decisions about partial results. Combined with correlation IDs, you can trace a single user request across dozens of provider calls and asynchronous boundaries.

Idempotency: The Requirement You Can’t Ignore

When you move beyond Task.WhenAll and implement proper resilience patterns, retries become essential. But retries demand idempotency, every operation must be safe to repeat.

The DEV article emphasizes this isn’t an optimization, it’s a requirement. Without idempotency keys, a retry after a network timeout can double-charge customers or send duplicate bookings. The flight search system implemented idempotency by requiring a client-generated search ID and storing provider responses keyed by that ID, preventing duplicate processing even when retries occurred.

For APIs, this means requiring an Idempotency-Key header. For background jobs, it means storing processed message identifiers in a fast datastore like Redis or DynamoDB. Skip this and your resilience patterns will create worse problems than they solve.

The Path Forward: Composable Patterns Over Magic Methods

The flight search system’s final architecture didn’t rely on a single pattern. It combined:

  • Scatter-Gather for parallel provider calls with individual timeouts
  • Correlation IDs to track requests across async boundaries
  • Aggregator to merge partial results safely
  • Circuit Breakers to stop calling failing providers
  • Async Reply over HTTP to return immediate responses
  • Hexagonal Architecture to keep business logic decoupled from integration concerns

This is the uncomfortable truth: Task.WhenAll and similar conveniences work for simple parallelization on a single machine. In distributed systems, they hide complexity without solving it. The solution isn’t more async magic, it’s explicit, composable patterns that acknowledge the eight failure modes of network requests.

Your Async Code Is Lying to You

If your system integrates multiple external providers and you’re using Task.WhenAll, you’re shipping a time bomb. It works in development, passes integration tests, and fails catastrophically under real-world conditions where providers degrade gradually rather than failing cleanly.

The AWS article warns that distributed bugs often lie dormant for months before causing outages. A provider that usually responds in 500ms might suddenly start taking 30 seconds during peak load. Your Task.WhenAll will happily wait, consuming threads and memory until your service collapses.

Practical Takeaways

  1. Never use Task.WhenAll for external integrations – Use Task.WhenAny with manual result collection and per-provider timeouts
  2. Design for partial results – Your aggregator should accept incomplete responses and still provide value
  3. Implement structured error responses – Don’t throw exceptions across service boundaries
  4. Make idempotency non-negotiable – Every mutating operation must be safe to retry
  5. Correlate everything – Pass correlation IDs through every async boundary
  6. Test failure modes individually – The eight network failure steps each need test coverage

The flight search code that survived production is available for exploration, showing these patterns in action. The difference between development and production isn’t edge cases, it’s that production has users who expect results even when your dependencies fail.

How have you seen async patterns fail in distributed systems? What patterns have saved you during incidents? The community’s collective war stories might prevent the next outage.

Further Reading