TCP Nagle's Delay
Someone recently asked me to help diagnose a mysterious delay on a globally deployed Elixir service.
Symptoms
- ~100 millisecond round trip downstream latency without TLS.
- ~20 millisecond round trip uptream latency.
- The delay could not be reproduced with other client types.
Explanation
The client configuration appeared correct e.g. TCP_NODELAY
. Further investigation revealed the configuration paramaters weren’t being honored. We managed to trace it to this client defect. This resulted in the following condition:
[Nagle’s] algorithm interacts badly with TCP delayed acknowledgments (delayed ACK), a feature introduced into TCP at roughly the same time in the early 1980s, but by a different group. With both algorithms enabled, applications that do two successive writes to a TCP connection, followed by a read that will not be fulfilled until after the data from the second write has reached the destination, experience a constant delay of up to 500 milliseconds, the “ACK delay”. It is recommended to disable either, although traditionally it’s easier to disable Nagle, since such a switch already exists for real-time applications.
With delayed acknowledgments and Nagle’s algorithm:
With delayed acknowledgments but without Nagle’s algorithm:
N.B. I have previously helped facilitate SRE training for this scenario.