
When DNS Goes Down: How to Weather the Cloud-Storm
Strategies for surviving DNS outages when everything breaks.
The AWS US-East-1 outage on October 20, 2025 ↗, wasn’t a server meltdown. It wasn’t a network cable being cut. The root cause was simple: a race condition in AWS’s own DNS automation that removed critical DynamoDB endpoint records. When DNS goes wrong, the internet breaks, even when the underlying services are perfectly healthy.
DNS is the phone book of the internet, translating friendly names like dynamodb.us-east-1.amazonaws.com into machine-readable IP addresses. When that translation fails, applications can’t connect, and nothing works.
The AWS outage demonstrates that even the largest cloud providers aren’t immune to DNS failures. According to AWS’s own Root Cause Analysis ↗, “The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record… that the automation failed to repair.”
Why DNS Fails Aren’t Just DNS Problems
DNS issues create cascading failures because every connection starts with a DNS lookup:
- Applications can’t resolve endpoints
- Control planes can’t coordinate
- Health checks fail
- Load balancers remove healthy instances
- Services enter retry storms
The Uptime Institute’s 2022 Outage Analysis Report identified DNS failures as among the most common reasons behind network-related outages, tying with configuration errors.

Beyond Multi-AZ: The Multi-Region Safety Net
Most cloud applications deploy across multiple availability zones (AZs) for redundancy. But AZ redundancy protects against single data center failures, not regional DNS issues. When AWS’s DNS automation removed all records for the DynamoDB regional endpoint, multi-AZ deployment provided zero protection.
Services like DynamoDB Global Tables did help, customers could connect to replicas in other regions, but replication lag created data consistency issues. The fundamental problem remained: DNS is typically a single point of failure unless specifically architected otherwise.
Practical DNS Resilience Patterns
Stale Record Caching with Serve-Stale
NodeLocal DNSCache’s serve_stale option lets applications continue using cached DNS records even when upstream DNS servers are unreachable. This RFC 8767 pattern prevents immediate failure when DNS servers are down, though it introduces risks if records change during the outage.
Multiple DNS Resolvers with Fallback
Don’t rely on a single DNS provider. Configure multiple resolvers with automatic failover:
Edge Caching Strategies
Use CDNs and edge caches to serve content even when origin DNS fails. CloudFront, Cloudflare, or Akamai can cache responses and continue serving users during DNS outages. Combine this with longer TTLs for critical endpoints, but balance this against the risk of stale records during normal operations.
Operator Patterns for DNS Management
Kubernetes operators provide a declarative approach to managing external dependencies. The ExternalDNS operator continuously reconciles your Kubernetes services with external DNS providers, implementing compare-and-swap semantics that prevent race conditions like the one that triggered the AWS outage.
“The DNS Management System failed because a delayed process overwrote new data”, explains the Kubernetes operator documentation ↗. “In Kubernetes, this is prevented by etcd’s atomic ‘compare-and-swap’ mechanism. Every resource has a resourceVersion. If an Operator tries to update a resource using an old version, the API server rejects the write.”
Hard-Coded Fallbacks: The Last Resort
When all else fails, having hard-coded IP fallbacks can keep critical systems running. This approach requires careful maintenance but provides ultimate resilience:
The trade-off? Hard-coded IPs can point to retired instances during normal operations. Use this pattern only for truly critical paths with circuit breakers to detect when fallbacks become stale.
Monitoring and Testing DNS Resilience
You can’t fix what you don’t measure. Implement synthetic monitoring that tests DNS resolution from multiple geographic locations. Tools like dig and monitoring services can alert when:
- TTL values change unexpectedly
- Resolution times increase
- Record types disappear
- Geographic DNS results diverge
Test failure scenarios regularly: “What happens if Route53 becomes unreachable? How quickly do we detect and failover?”
The Multi-Cloud Safety Net
While avoiding cloud provider lock-in is ideal, the reality is most organizations won’t maintain active-active multi-cloud deployments. More practical: ensure critical dependencies (DNS, identity, secrets) can failover across providers during regional outages.
Services like Auth0 or Okta for identity, Cloudflare for DNS, and HashiCorp Vault for secrets management provide cross-cloud resilience without maintaining full application stacks in multiple clouds.
DNS failures will happen, whether from provider outages, configuration errors, or automation bugs. The AWS outage reminds us that even the most sophisticated systems have single points of failure. Building DNS resilience requires:
- Multiple resolution strategies (caching, fallbacks, operators)
- Regular failure testing
- Cross-provider redundancy for critical dependencies
- Graceful degradation when DNS is unavailable
When DNS breaks, your applications shouldn’t. Plan for the inevitable, because in distributed systems, “it’s always DNS” until you’ve designed around the dependency.



