This page is not about an existing protocol but it describes a potential future way for PSYC or any other application to ensure that TCP connections are alive and well. Interestingly, since once more the problem seems to be best approached by multicast tree distribution, PSYC could itself be part of a solution.
Contents |
Keep Alive Protocol
Objective
Many applications on today's Internet do all kinds of homegrown ping/pong strategies to make sure, they are still connected to their respective servers or peers.
The objective of a keep-alive protocol would be that an application can be 100% sure that it will be informed as soon as possible of a network drop out, anywhere along the route of its TCP lines, without it doing a single bit of overhead messaging.
Motivation
There was a time, when the Internet did not have firewalls and network address translators. When a broken TCP connection would trigger an ICMP message to inform applications that a link broke or a remote host has became unreachable. Only when a power outage occurred, or a cable is physically disconnected, then information about this event would not automatically reach applications affected by it. This was not a frequent case, but nonetheless treated with the SO_KEEPALIVE option for TCP sockets.
This is still the behaviour of TCP in the core backbones of the Internet today, additionally, power outages are also taken care of nowadays. But on the way to the end user, TCP connections aren't very safe anymore.
IRC may have been the first protocol to employ a custom ping strategy, defeating automatic modem dial-up systems from going into sleep mode. SO_KEEPALIVE was so well established, you could keep TCP connections up while the ISDN or telephone line was actually shut down. Any incoming packet would then reconnect it, in the meantime the end points would issue fake keep-alive acknowledgments, as if in a conspiracy with the other end of the telephone line. That was in the late 90s.
Newer technologies have made this approach impossible. Nowadays every protocol runs its own strategy to ensure the line is still functioning. Dial-up is hardly a viable strategy for true membership in the Internet. SO_KEEPALIVE has become so unreliable, that only a ping and subsequent reception of a pong tells you, you are still connected. And to be sure, you even have to do it twice, from both ends of the TCP connection.
Since some router hardware thinks, idle TCP connections are abnormal (websurfing being the norm), applications have to keep their TCP artificially busy, down to at least once a minute. Time-critical applications may require even more frequent checks.
This huge mass of self-brewed ping strategies in each application for each user is bound to cause quite a relevant amount of overhead traffic on the Internet. It would be a good idea to turn attention to it, and come up with a more efficient solution to the problem.
Idea
So here's the concept for a protocol, which brings the idea of KeepAlive up-to-date. Let's call it KAP.
Each application using KAP would simply turn off its own pinging and use a KAP API to formulate its network reliability requirements to the KAP backend. Probably a flag like SO_KEEPALIVE would be sufficient, really.
A host-wide KAP daemon would know which other hosts the applications (or the entire system kernel) is keeping TCP connections to. It would tell the network gateway/router, to maintain stable links to those hosts (actually they could be implicit by the mere fact that TCP links have been created), and use KAP pings itself to ensure the gateway is not going away. Thus, between the host and the router only one ping strategy would be used.
Looking at the router, it may be connected to the Internet via a DSL communication. DSL already has capacity built-in to detect connection loss. It would be sufficient for the router to tell the DSL counterpart, to ensure reliability of all TCP links that have been deployed via this DSL session. A simple flag in PPP negotation could be sufficient to achieve that. Or it could be the future default behaviour.
The WAN is stateless
The other side of such a DSL connection is a wide area network router. It is itself communicating with many routers, implementing TCP to certain hosts over a number of router hops. Since these routers typically do not keep state of the TCP sessions they are implementing, they only have a vague notion of which neighbors they are currently required to keep a reliable connection to. Should any intermediate router disappear, the next router in line may detect so, but it would not know which routers to inform about the loss.
Needing to know if a remote host is available is essentially a publish/subscribe problem, good to implement with multicast distribution trees to ensure scalability. Each notification should only travel the network once.
Each of the circuits between each routing node needs to be monitored for health. In sane backbones of the Internet, a lower checking frequency would be sufficient while a high one is necessary across nasty borders. Should a circuit break down, all subscriptions depending on that circuit can be informed with an error message.
This means, there would be no end-to-end ping/pong anymore. When an application intends to stay connected to a host its KAP daemon subscribes to the host's keep-alive context. Such contexts would be configured not to have centralized enter/leave access control (the default management mode for PSYC contexts), allowing the closest router on the multicast tree to quietly handle the job on behalf of the KAP daemon, without giving the monitored host the burden of keeping a complete list of subscribers, or even having to acknowledge them.
Interestingly, this scenario could be implemented using a protocol like PSYC.