This seems to be caused by cluster nodes running out of so-called "ephemeral" TCP/IP ports. A quick TCP/IP 101: every time a TCP/IP connection is established between two systems, a port number on each system is allocated for the connection. If either side does not explicitly specify a port number, the OS assigns one dynamically from a pool of port numbers. These are called ephemeral ports because they are intended for use in transient TCP/IP connections. Normally the client side of a TCP/IP connection uses an ephemeral port, and the server side uses a well-known port number (such as port 80 for web servers). By default, Windows is configured to use a pool of about 4,000 port numbers for ephemeral ports.
Due to the way TCP/IP works, the port used for a connection is not released for a certain amount of time after the application using it closes its connection. This is required by TCP/IP in order to guarantee reliability of data transfers. The TCP/IP term for this state is called TIME_WAIT, and the length of time that a port stays in this state is typically between 60 and 240 seconds. As long as a port is in the TIME_WAIT state, it cannot be used for a new connection.
Given that Windows has such a small range of port numbers, and that those ports can get stuck in the TIME_WAIT state for so long, it's easy to imagine a scenario in which all ephemeral ports become unavailable due to an application rapidly opening, using, and closing TCP/IP connections. Unfortunately, this is precisely how the agent's agent-to-agent connections are used. After the port range is exhausted, further connections to or from the system are rejected until one of the ephemeral ports becomes available. You can confirm that this is in fact the problem by running "netstat -a" on the node when you see these errors. If this is the problem, you will see hundreds or thousands of sockets listed in the TIME_WAIT state.
Redesigning this part of the software to avoid the problem entirely is in progress, but until this is complete, use this work around. Reconfigure Windows to make more ports numbers available for use as ephemeral ports. This is what Microsoft recommends doing for its own port-hungry servers, such as MS Exchange and MS SQL Server. A knowledge base article that describes another manifestation of the same problem and instructions for reconfiguring Windows is available at:
Our current installer should set this value high.
If you still need to change the value and wish to avoid logging in to each system to change the registry, use the cmtool utility:
Where "NNNNNNN" is the hostname or IP address of the Cluster Manager.
This changes the upper end of the ephemeral port range from the default 5000 to the Microsoft-recommended 60000. You must reboot the nodes after making this change.
- Product versions: All
- OS versions: Windows