Hello!) My name is Ilyas. In this article we will look at a well-known idea – keepalive in interservice communication, which has saved more than one company in difficult times :). But to add interest, we will look at what problems modern technologies have brought to keepalive (after all, what could go wrong with this simple idea?). Therefore, in this article we will look at mechanisms that allow you to check the stability of the connection between the client and the server in the case when regular TCP keepalives, due to the complexity of the architecture, cannot determine the state of the server.
Introduction
The moment you decide to cut your monolith into several parts (or start a new project with a microservice architecture), you accept a number of risks that arise with this approach to building a monolith. And one of these risks is the possibility of failure of some part of your system. And, as practice shows, the story here is exactly the same as with database backups – sooner or later, most systems encounter such a problem. In such a situation, when you will see on your graphs a bunch of client cancellations or Internal Server Error along with Context Deadline Exceeded (at best) and wonder what is happening, because that colleague who quit a couple of months ago, of course , you haven’t set up alerts for the failure of instances of his service, you will have a very long and painful time digging out the cause of the problem. And such situations have happened to us. So, the article is about what measures can be taken in advance to avoid this pain and work out the failure scenario of any part of the service more accurately and better.
But still, before discussing all the most interesting things, we still need to remember why these keepalives are even needed. Below the spoiler is a small educational program about this.
Educational program about keepalive
Reference documentation here.
Keepalive is a mechanism for determining the state of someone or something on the other side, no matter what. In our context, we will be discussing inter-service communication, so we will have two computers that communicate with each other over a network. And through keepalive we will find out whether the computer at the other end of our line is alive.
The concept of keepalive in our context is quite simple. When we have established a connection between these two computers, we start timers on each of them, after a specified time, on which these computers will send each other certain messages, called keepalive probes. If such a message is received, each computer will send a response message indicating that it is alive (ACK – acknowledge). Well, accordingly, when we receive an ACK in response to our keepalive probe, we conclude that the computer on the other side is alive. Otherwise, if after sending a keepalive probe we do not receive this ACK for some time, we conclude that the computer on the other side is no longer accessible. For reliability, we can make several such keepalive probes in order to make a correct conclusion about the unavailability of the computer on the other side with a higher probability. All settings for timers and number of samples are configurable. You can read more about this in the reference documentation.
TCP keepalive problem
The most popular and proven mechanism for decades is the TCP keepalive mechanism, which operates at the L4 level of the OSI model. This mechanism works amazingly when two nodes interact with each other. But in the modern architecture of virtualization, containerization, the presence of proxies and other intermediate systems, this mechanism has a significant drawback, namely, the impossibility of checking the state of the end node in the presence of an intermediary in such interaction. In case you are making requests through an intermediary, the TCP keepalive configured on your side will check the status of that intermediary. Well, this intermediary already determines the state of the end node. In this case, you need to configure TCP keepalive parameters in 3 places at once.
And, since sometimes this intermediary is configured by someone else, you cannot reliably determine the state of the end node with your settings. On Linux, by default, the first keepalive request will be sent after 2 hours of connection inactivity, and you will need to make 9 keepalive probes every 75 seconds, only after that the connection will be marked as down. In total, to determine the unavailability of an end node with default settings, it will take a little more than 2 hours and 10 minutes, which is an extremely slow result for more or less loaded systems.
A little offtopic: among other things, specifically in Goshka we discovered some peculiarity in setting up L4 keepalive. By default, the net package sets TCP keepalive parameters that are different from the linux settings. To use OS settings, net.Transport must be configured in some other way. You can read more details here And here. MP with the addition of the ability to override TCPKeepAliveIdle, TCPKeepAliveInterval and TCPKeepAliveCount here.
And in this case, keepalives at a higher level come to the rescue, namely at the L7 level.
HTTP/2 ping
These mechanisms are implemented in the HTTP/2 protocol and are called PING.
Such mechanisms allow, if there is an intermediary at the L4 level, to check the status of the end node, bypassing this intermediary.
Since we have an article feat. Go, let’s talk about the implementation in it.
The configuration of the previously described parameters in the box is here. There are two parameters – ReadIdleTimeout and PingTimeout. The first parameter determines how long after we will send a Ping Frame after we stop receiving any data from the server. PingTimeout is the time after which we will close the connection after sending a ping frame if we do not receive an ACK response packet. The source code for this mechanism can be viewed here.
// ReadIdleTimeout is the timeout after which a health check using ping
// frame will be carried out if no frame is received on the connection.
// Note that a ping response will is considered a received frame, so if
// there is no other traffic on the connection, the health check will
// be performed every ReadIdleTimeout interval.
// If zero, no health check is performed.
ReadIdleTimeout time.Duration
// PingTimeout is the timeout after which the connection will be closed
// if a response to Ping is not received.
// Defaults to 15s.
PingTimeout time.Duration
And, it would seem, everything is fine, but the implementation in Goshka does not provide for the ability to send several Ping Frames before the connection is closed. Whether this is good or bad is an open question. It is assumed that packet delivery guarantee is provided at the L4 layer. At the moment of start HTTP2 starts time.AfterFunc, which will launch the ping mechanism via ReadIdleTimeout. Further in the code there is a comment: “We don’t need to periodically ping in the health check, because the readLoop of ClientConn will trigger the healthCheck again if there is no frame received.”, but here, apparently, we are talking about that the healthcheck function does not need to run pings several times, since the higher readloop does it for it (and it has time.AfterFunc 🙂 ).
In any case, correct configuration of this mechanism allows you to send Ping at the L7 level without worrying about what TCP keepalive setting is on the proxy through which requests are made.
gRPC
We also have an article feat. gRPC-Go, so let’s talk a little about it.
gRPC-Go has its own implementation of an HTTP/2 client, and therefore has a number of features in the PING mechanism (apparently, this happened due to the fact that the implementation in gRPC appeared several months earlier than in Go) On the side The end node has additional settings that can limit the number of pings from the client. These settings are called Enforcement policy.
type EnforcementPolicy struct {
// MinTime is the minimum amount of time a client should wait before sending
// a keepalive ping.
MinTime time.Duration // The current default value is 5 minutes.
// If true, server allows keepalive pings even when there are no active
// streams(RPCs). If false, and client sends ping when there are no active
// streams, server will send GOAWAY and close the connection.
PermitWithoutStream bool // false by default.
}
Two parameters are defined in these settings: the minimum ping time and permission to send pings in the absence of concurrent requests. If the client violates these settings, the gRPC library on the end node side may close the connection with a too_many_pings error. Therefore, for the correct implementation of keepalive, it is necessary to coordinate the settings of the client and server.
There is a detailed description under the spoiler
The minimum time limits how often a client can ping. If the client starts sending pings more often than this mintime, then the server will close the connection after 3 such pings with the error too_many_pings. Similar to permitWithoutStream, which allows pings to be sent if the client does not make any requests to this backend at the time the ping is made. If you have a small number of requests made to this backend, then you will more often find yourself in a situation where ping will be performed without the presence of parallel requests. Thus, if it is important for you to quickly process such requests, without additional retrays and delays, you should allow keepalive pings without concurrent requests. Then the system will constantly poll the backend, and when it comes time to make a request, the system will send it to a reliably functioning backend, without performing any overhead checks in advance.
BUT! In gRPC-Go, if the connection is closed due to the too_many_pings error, the client part of the library will reopen the connection, but will send ping frames 2 times less often than it did before receiving the error. As a result, sooner or later, the system will automatically coordinate the client and backend settings and begin sending ping frames in accordance with the Enforcement policy.
conclusions
As for the technical part of the article, PING Frame in the HTTP/2 implementation is an extremely useful mechanism, but since there are no strict requirements for its configuration, it can be implemented differently in different systems. But if you have an intermediary in your connection, and you do not know how L4 tcp keepalive is configured on it (and you are not in charge of its configuration), then it makes sense to look towards HTTP2 pings.
Well, as for the business part, set up keepalive at all levels in your systems and don’t lose traffic. Believe me – it’s very unpleasant 🙂
Acknowledgement and Usage Notice
The editorial team at TechBurst Magazine acknowledges the invaluable contribution of the author of the original article that forms the foundation of our publication. We sincerely appreciate the author’s work. All images in this publication are sourced directly from the original article, where a reference to the author’s profile is provided as well. This publication respects the author’s rights and enhances the visibility of their original work. If there are any concerns or the author wishes to discuss this matter further, we welcome an open dialogue to address potential issues and find an amicable resolution. Feel free to contact us through the ‘Contact Us’ section; the link is available in the website footer.