KBEC-00089 - Overcoming AGENT_TIMEOUT on slow or overloaded systems

Description

It is possible, before the completion of the step, for the CloudBees CD (CloudBees Flow) server to declare a timeout error (AGENT_TIMEOUT) even though the step as completed. You can see in the step log that the steps completed normally and the step can even show start and end timestamps in the log file.

Typically, this occurs when you have a heavily loaded agent, where the agent machine is so bogged down processing one or more commands and possibly paging, that the CloudBees CD agent service cannot process messages from the CloudBees CD server. One way to alleviate the problem is to configure the server to be more tolerant of busy agents.

By default, every thirty seconds, the server sends a ping message to the agent to verify that the agent is still healthy. If several of these ping messages go unanswered, the server concludes that it has a dead agent and aborts any running steps (in the server’s record of job step status). Because the CloudBees CD server cannot communicate with the CloudBees CD agent, it does not send an abort message to the agent, and the commands currently executing on the agent continue to run to completion.

Solution

Increase the CloudBees CD server’s setting for the agent-ping timeout. This is one setting at the server that affects all agents. If it takes one minute to respond, set a 1.5 minute timeout instead of the default 0.5 minutes.

To set the wait for agent response time, which is in milliseconds, to 1.5 minutes use:

ectool --server  setProperty /server/settings/agentSocketTimeout 90000

The CloudBees CD server does not need to be restarted for the new value to take effect. Rerun the step to see if the agentSocketTimeout had the desired effect.

Bonus Solution

Overburdened agents may not respond to a request for a new connection quickly either.

The agentConnectionTimeout property is the time the CloudBees CD server waits for a request for a connection to time out. The does not happen as often as it is usually the agent step activity initiated over the connection causing the slow response.

To set the wait for socket creation, which in in milliseconds, to 1.5 minutes use:


ectool --server  setProperty /server/settings/agentConnectionTimeout 90000

The CloudBees CD server does not need to be restarted for the new value to take effect. Rerun the step to see if the agentConnectionTimeout had the desired effect.

Have more questions?

1 Comments

  • 0
    Avatar
    Suresh Venkatesan

    In addition to increasing those timeouts please check if the agent is running into the “Too many open files” issue due to which the socket cannot be opened from Flow server  to the agent ( i.e a issue on the agent side) . Then increase the “open files limit” for the agent as mentioned in the Flow Install Guide   ( and https://www.tecmint.com/increase-set-open-file-limits-in-linux/ )

    Edited by Suresh Venkatesan
Please sign in to leave a comment.