Why do slaves show as suspended while jobs wait in the queue?

Issue

  • Operation Center agents show as suspended while jobs wait in the client master queue.
  • Why are my agents showing up as suspended?
  • How to switch to a different strategy for my JOC agents?

Environment

  • CloudBees Jenkins Operations Center (JOC)
  • ‘Shared Agent’ and/or ‘Shared Cloud’
  • ‘Client masters’ (CMs) and/or ‘Managed Masters’ (MMs)

Resolution

Starting points

Under the described environment, CM can share agents… which means that they will be competing for the same agents

The pattern seen is that the JOC agent(s) get picked up by the first CM. The first CM has jobs being queued at a rate sufficient that the agent(s) is/are always in use.

Once a JOC agent serves its executors to a CM, the rest of the CMs/MMs do not have access to that JOC agent until is released. Under the executing service, one JOC agent can just each of its executor once for the same CM. Once that JOC agent goes back under the JOC control recover all its defined executors, ready for the next or the same CM.

Each CM has its own queue and JOC does not know the size of the queue that it is being requested to provision against. Most critically of all, each CM is unaware of the needs of the other CMs/MMs.

There are a number of different factors that come into play:

  • The number of executors per agent
  • The rate of jobs arriving
  • The length of time that each job takes to complete
  • The strategy for retaining agents on a master

Strategies

Four different strategies have been tested against a variety of conditions.

  • Single shot - where the agent is retained until at least one build has completed.
  • Single shot (multi-executor) - where the agent is retained until at least one build has completed and at most one build per executor has completed.
  • Multi shot - where the agent is retained until at least one build has completed and at most a configurable fixed number of builds as completed.
  • Monopolising - where the agent is retained until there are no jobs in the queue for the agent to execute builds on.

JOC 1.0 was released with Single-shot retention strategy and the recommendation to configure shared agents for one executor only. Single-shot gives very good and consistent behaviour. The key metric (average time between entering the queue and starting execution) would behave as you would expect based on experience with a standalone master. i.e. it is purely a function of the size of the work pool, the rate of jobs arriving and the length of time jobs spend executing (with some modification for the provisioning delay required as the length of time executing nears 0).

Obviously, it can be tricky to configure many agents (there are some techniques to make it easier), so in response to requests to support more than one executor per agent, Single shot (multi-executor) comes to action. This, unsurprisingly, gave the highest throughput of builds when looked at as a cluster average. So for the JOC 1.1 release switched to this strategy as our default.

When you have a single CM, the overhead of lease and release means that the Monopolising strategy is the best. As soon as you have more than one CM (and the whole point of JOC is that you have more than one CM), is there is any contention at all for the agents that the CM need then the Monopolising Strategy leads to queue explosions… this is especially the case where the agents have more than one executor. As mentioned, once JOC agents serves to a CM, other CMs/MMs never get a chance to use that agent (hence why it is called Monopolising strategy). What you then see is that the number of jobs in the queue on the other masters explode and the key metric (average time between entering the queue and starting execution) across the cluster skyrockets.

Switching to Multi-shot counteracts this issue. Multi-shot has been tested with the count at 100, 50 and 20 builds. What has been found was that the probability of the key metric (average time between entering the queue and starting execution) across the cluster skyrocketing increased as the count increased, IOW 20 was less likely than 50 to go rouge, 50 was less likely than 100.

None monopolising strategies keep a count of the number of builds that have started on the agent. As soon as that count goes above the strategies limit, it marks the agent as suspended (which is how you tell Jenkins not to assign any more work to that agent).

How to switch to a different strategy?

By default, JOC comes with retentionStrategyShotCount=1, then depending on # of executors

  • Single shot - # of executors = 1.
  • Single shot (multi-executor) - # of executors > 1.

To leave Single shot retentionStrategyShotCount property needs to be updated.

  • Multi shot - -Dcom.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=1
  • Monopolising - -Dcom.cloudbees.opscenter.client.cloud.CloudImpl.retentionStrategyShotCount=-1
Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.