TCP_NODELAY latency for microservices

Your Microservices Latency Nightmare Is Probably Just TCP_NODELAY

Why disabling Nagle’s algorithm has become the default debugging ritual for distributed systems builders, and what it reveals about modern architecture trade-offs.

by Andre Banandre

If you’ve spent more than fifteen minutes debugging mysterious latency spikes in a distributed system, someone’s already suggested it. If you’ve spent more than two hours, you’ve probably tried it. And if you’ve spent a full day chasing p99 delays that make no sense, you already know the punchline: TCP_NODELAY.

Marc Brooker nailed it in his post that ricocheted around Hacker News: "It’s always TCP_NODELAY. Every damn time." This isn’t a coincidence, it’s a systemic failure of default configurations meeting modern architecture realities. The fact that every distributed systems builder has lost hours to this one socket option reveals something deeper about how poorly our tools match our current environment.

The 1980s Algorithm That Won’t Die

Nagle’s algorithm made perfect sense in 1984. John Nagle’s RFC896 described a world where single-byte keyboard inputs were getting wrapped in 40-byte TCP headers, an obnoxious 4000% overhead. His solution was elegant: don’t send new data until you’ve received an ACK for the previous batch. This simple rule forced small messages to coalesce, amortizing header costs and preventing network congestion from a million tiny packets.

The logic was sound for telnet sessions and serial lines. But here’s where it gets spicy: Nagle’s algorithm never used timers. It purely depends on round-trip time (RTT). When your RTT is measured in seconds over dial-up modems, waiting for an ACK is reasonable. When your RTT is 500 microseconds inside a modern datacenter, you’re still waiting. And modern servers can execute hundreds of thousands of operations in that half-millisecond.

The Delayed ACK Trap

The real villain isn’t Nagle alone, it’s his toxic relationship with delayed acknowledgments. Delayed ACKs were designed to piggyback acknowledgments on response data, avoiding empty ACK packets. RFC813 from 1982 suggested delaying ACKs "until there’s some data to send back", which RFC1122 later formalized with a timer.

Combine these two reasonable ideas and you get a deadlock:

  • Nagle’s algorithm: "I won’t send more data until I get an ACK"
  • Delayed ACK: "I won’t send an ACK until I have data to send"

In a request-response pattern, this interaction adds a full RTT of latency for no reason. Nagle himself complained about this on Hacker News: "That still irks me. The real problem is not tinygram prevention. It’s ACK delays, and that stupid fixed timer. They both went into TCP around the same time, but independently. I did tinygram prevention (the Nagle algorithm) and Berkeley did delayed ACKs, both in the early 1980s. The combination of the two is awful."

This is systems design 101: two locally optimal features creating globally terrible behavior. And it’s still biting us forty years later.

Why Microservices Make This Worse

The microservices explosion didn’t create this problem, but it weaponized it. As Derek Comartin points out in his analysis of microservice coupling, introducing network boundaries exposes problems that were always there. In a monolith, you might not notice a few milliseconds of delay between components. But when service A calls service B calls service C, and each hop has a potential Nagle/delayed ACK interaction, you’re not adding latency, you’re multiplying it.

Consider a typical three-service request flow in a Kubernetes cluster:

  1. API gateway receives client request
  2. Calls auth service (potential Nagle delay)
  3. Calls user service (potential Nagle delay)
  4. Calls billing service (potential Nagle delay)

Each hop inside a datacenter adds ~500μs RTT. With Nagle + delayed ACK, you could see 1-2ms per hop. Three services later, you’ve added 6ms to your p99 for absolutely no benefit. In high-throughput systems, this compounds into observable tail latency that violates SLOs and drives engineers to drink.

The Application Layer Already Solved This

Here’s the kicker: Nagle’s algorithm is solving a problem that doesn’t exist in modern distributed systems. The original justification was preventing 41-byte packets from inefficient apps doing byte-by-byte writes. But modern distributed systems don’t work that way:

  • Serialization overhead: JSON, protobuf, or Avro encoding means you’re never sending single bytes
  • Protocol buffers: gRPC and other RPC frameworks batch at the application layer
  • TLS overhead: The encryption layer adds its own framing and buffering
  • Batching is explicit: Message queues and streaming systems intentionally coalesce messages

Applications that care about throughput already batch. Applications that care about latency already manage their buffers. The kernel’s attempt to be helpful is just getting in the way.

The Data Behind the Default

Let’s talk numbers. Brooker’s analysis highlights some stark realities:

  • In-datacenter RTT: ~500 microseconds
  • Cross-region RTT: 2-50 milliseconds
  • Modern server capacity: Millions of operations per millisecond

The performance delta is measurable. In low-latency trading systems, enabling TCP_NODELAY can shave off 50-100 microseconds per message. For a system doing 100,000 messages per second, that’s the difference between meeting your latency budget and explaining to the business why you missed it.

AWS’s own latency optimization guidance for real-time applications implicitly acknowledges this. While they focus on infrastructure solutions like Global Accelerator and Direct Connect, the underlying assumption is that your application isn’t adding unnecessary delays. Nagle’s algorithm is the definition of an unnecessary delay.

When to Actually Worry

Brooker’s take is refreshingly direct: "if you’re building a latency-sensitive distributed system running on modern datacenter-class hardware, enable TCP_NODELAY (disable Nagle’s algorithm) without worries." No caveats, no hand-wringing. Just do it.

But when should you not disable it? Almost never. The original use case, terminal sessions with human typing, has been replaced by SSH with its own buffering. The only scenario where you might want Nagle’s algorithm is if you’re dealing with truly pathological application code that does single-byte writes with no buffering. And in that case, you should fix the application, not rely on a kernel-level bandaid from the Reagan administration.

The Bigger Picture: Defaults Matter

This isn’t just about one socket option. It’s about a pattern: our industry’s failure to evolve defaults as contexts change. Nagle’s algorithm made sense for 1980s networks. It makes zero sense for 2020s distributed systems. Yet it’s still the default in every major OS.

This reveals a deeper architectural principle: defaults are architecture. When we accept defaults without questioning them, we’re making design decisions by omission. Most engineers don’t even know Nagle’s algorithm exists until it bites them. They certainly don’t know about its dysfunctional relationship with delayed ACKs.

The real controversy isn’t whether to disable Nagle’s algorithm, every experienced distributed systems engineer already does. The controversy is why we still have to. Why isn’t TCP_NODELAY the default for server sockets? Why do we force every new generation of engineers to rediscover this footgun?

Practical Actions

Stop what you’re doing and check your services:

# Check if TCP_NODELAY is set (this is harder than it should be)
ss -tie | grep -E 'tcp-urg|nodelay'

# In your code, set it explicitly
# Python
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

# Go
conn.(*net.TCPConn).SetNoDelay(true)

# Java
socket.setTcpNoDelay(true);

Then add it to your service templates, your framework defaults, your code review checklist. Make it the default for every new service, and backport it to existing ones during the next maintenance window.

While you’re at it, audit your other TCP settings. TCP_QUICKACK exists but has portability issues and weird semantics. As Brooker notes, it doesn’t fix the fundamental problem: the kernel shouldn’t hold your data hostage.

The Takeaway

The next time someone suggests "maybe it’s TCP_NODELAY", resist the urge to roll your eyes. They’re probably right. But also ask: why are we still having this conversation in 2025?

The algorithm that saved 1980s networks from congestion is now causing congestion in our debugging workflows. It’s time to recognize that the environment has changed, the assumptions are invalid, and the default is wrong. TCP_NODELAY shouldn’t be a debugging tip passed around in Slack channels, it should be the default for any networked application built after 2010.

Until that happens, keep spreading the gospel: it’s always TCP_NODELAY. Every damn time.