When DNS Goes Down: How to Weather the Cloud-Storm

When DNS Goes Down: How to Weather the Cloud-Storm

Strategies for surviving DNS outages when everything breaks.
October 24, 2025

The AWS US-East-1 outage on October 20, 2025, wasn’t a server meltdown. It wasn’t a network cable being cut. The root cause was simple: a race condition in AWS’s own DNS automation that removed critical DynamoDB endpoint records. When DNS goes wrong, the internet breaks, even when the underlying services are perfectly healthy.

DNS is the phone book of the internet, translating friendly names like dynamodb.us-east-1.amazonaws.com into machine-readable IP addresses. When that translation fails, applications can’t connect, and nothing works.

The AWS outage demonstrates that even the largest cloud providers aren’t immune to DNS failures. According to AWS’s own Root Cause Analysis, “The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record… that the automation failed to repair.”

Why DNS Fails Aren’t Just DNS Problems

DNS issues create cascading failures because every connection starts with a DNS lookup:

  • Applications can’t resolve endpoints
  • Control planes can’t coordinate
  • Health checks fail
  • Load balancers remove healthy instances
  • Services enter retry storms

The Uptime Institute’s 2022 Outage Analysis Report identified DNS failures as among the most common reasons behind network-related outages, tying with configuration errors.

DNS illustration showing DNS resolution workflow

Beyond Multi-AZ: The Multi-Region Safety Net

Most cloud applications deploy across multiple availability zones (AZs) for redundancy. But AZ redundancy protects against single data center failures, not regional DNS issues. When AWS’s DNS automation removed all records for the DynamoDB regional endpoint, multi-AZ deployment provided zero protection.

Services like DynamoDB Global Tables did help, customers could connect to replicas in other regions, but replication lag created data consistency issues. The fundamental problem remained: DNS is typically a single point of failure unless specifically architected otherwise.

Practical DNS Resilience Patterns

Stale Record Caching with Serve-Stale

NodeLocal DNSCache’s serve_stale option lets applications continue using cached DNS records even when upstream DNS servers are unreachable. This RFC 8767 pattern prevents immediate failure when DNS servers are down, though it introduces risks if records change during the outage.

1apiVersion: v1 2kind: ConfigMap 3metadata: 4 name: coredns 5 namespace: kube-system 6data: 7 Corefile: | 8 .:53 { 9 errors 10 health { 11 lameduck 5s 12 } 13 ready 14 kubernetes cluster.local in-addr.arpa ip6.arpa { 15 pods insecure 16 fallthrough in-addr.arpa ip6.arpa 17 ttl 30 18 } 19 prometheus :9153 20 forward . /etc/resolv.conf 21 cache { 22 serve_stale 1h 23 } 24 loop 25 reload 26 loadbalance 27 }

Multiple DNS Resolvers with Fallback

Don’t rely on a single DNS provider. Configure multiple resolvers with automatic failover:

1# /etc/resolv.conf 2nameserver 8.8.8.8 # Google DNS 3nameserver 1.1.1.1 # Cloudflare 4nameserver 169.254.169.253 # AWS Internal DNS 5options timeout:1 6options attempts:2

Edge Caching Strategies

Use CDNs and edge caches to serve content even when origin DNS fails. CloudFront, Cloudflare, or Akamai can cache responses and continue serving users during DNS outages. Combine this with longer TTLs for critical endpoints, but balance this against the risk of stale records during normal operations.

Operator Patterns for DNS Management

Kubernetes operators provide a declarative approach to managing external dependencies. The ExternalDNS operator continuously reconciles your Kubernetes services with external DNS providers, implementing compare-and-swap semantics that prevent race conditions like the one that triggered the AWS outage.

“The DNS Management System failed because a delayed process overwrote new data”, explains the Kubernetes operator documentation. “In Kubernetes, this is prevented by etcd’s atomic ‘compare-and-swap’ mechanism. Every resource has a resourceVersion. If an Operator tries to update a resource using an old version, the API server rejects the write.”

Hard-Coded Fallbacks: The Last Resort

When all else fails, having hard-coded IP fallbacks can keep critical systems running. This approach requires careful maintenance but provides ultimate resilience:

1import socket 2 3def resolve_with_fallback(hostname, fallback_ips): 4 try: 5 return socket.getaddrinfo(hostname, None) 6 except socket.gaierror: 7 return [(socket.AF_INET, socket.SOCK_STREAM, 6, '', (ip, 0)) 8 for ip in fallback_ips]

The trade-off? Hard-coded IPs can point to retired instances during normal operations. Use this pattern only for truly critical paths with circuit breakers to detect when fallbacks become stale.

Monitoring and Testing DNS Resilience

You can’t fix what you don’t measure. Implement synthetic monitoring that tests DNS resolution from multiple geographic locations. Tools like dig and monitoring services can alert when:

  • TTL values change unexpectedly
  • Resolution times increase
  • Record types disappear
  • Geographic DNS results diverge

Test failure scenarios regularly: “What happens if Route53 becomes unreachable? How quickly do we detect and failover?”

The Multi-Cloud Safety Net

While avoiding cloud provider lock-in is ideal, the reality is most organizations won’t maintain active-active multi-cloud deployments. More practical: ensure critical dependencies (DNS, identity, secrets) can failover across providers during regional outages.

Services like Auth0 or Okta for identity, Cloudflare for DNS, and HashiCorp Vault for secrets management provide cross-cloud resilience without maintaining full application stacks in multiple clouds.


DNS failures will happen, whether from provider outages, configuration errors, or automation bugs. The AWS outage reminds us that even the most sophisticated systems have single points of failure. Building DNS resilience requires:

  • Multiple resolution strategies (caching, fallbacks, operators)
  • Regular failure testing
  • Cross-provider redundancy for critical dependencies
  • Graceful degradation when DNS is unavailable

When DNS breaks, your applications shouldn’t. Plan for the inevitable, because in distributed systems, “it’s always DNS” until you’ve designed around the dependency.

Related Articles