It is possible, before the completion of the step, for the Flow server to declare a timeout error (AGENT_TIMEOUT) even though the step as completed. You can see in the step log that the steps completed normally and the step can even show start and end timestamps in the log file.
Typically, this occurs when you have a heavily loaded agent, where the agent machine is so bogged down processing one or more commands and possibly paging, that the Flow agent service cannot process messages from the Flow server. One way to alleviate the problem is to configure the server to be more tolerant of busy agents.
By default, every thirty seconds, the server sends a ping message to the agent to verify that the agent is still healthy. If several of these ping messages go unanswered, the server concludes that it has a dead agent and aborts any running steps (in the server’s record of job step status). Because the Flow server cannot communicate with the Flow agent, it does not send an abort message to the agent, and the commands currently executing on the agent continue to run to completion.
Increase the Flow server’s setting for the agent-ping timeout. This is one setting at the server that affects all agents. If it takes one minute to respond, set a 1.5 minute timeout instead of the default 0.5 minutes.
To set the wait for agent response time, which is in milliseconds, to 1.5 minutes use:
ectool --server setProperty /server/settings/agentSocketTimeout 90000
The Flow server does not need to be restarted for the new value to take effect. Rerun the step to see if the agentSocketTimeout had the desired effect.
Overburdened agents may not respond to a request for a new connection quickly either.
The agentConnectionTimeout property is the time the Flow server waits for a request for a connection to time out. The does not happen as often as it is usually the agent step activity initiated over the connection causing the slow response.
To set the wait for socket creation, which in in milliseconds, to 1.5 minutes use:
ectool --server setProperty /server/settings/agentConnectionTimeout 90000
The Flow server does not need to be restarted for the new value to take effect. Rerun the step to see if the agentConnectionTimeout had the desired effect.