KBEC-00505 - Understanding the CD Step Scheduler under load

Problem

How can the CD Scheduler be configured to limit the number of steps that can be scheduled on each pass

Solution

As any step in CD becomes “runnable” - it needs to be confirmed whether a resource can be assigned to that step.
This confirmation is handled by the StepScheduler, as outlined in this article:

See:  KBEC-00472-Understanding CloudBees CD Step Status Values

Every time the scheduler starts, it will use the system setting - resourceSchedulerBatchSize to define the maximum number of steps that may be processed on any given pass.

The default value for this setting is 500.

Since the CD system is a dynamic environment, let’s use a simplified example to try to understand what this means.
Let’s assume that 1200 steps are currently waiting to be processed.

With 1200 steps sitting in a “runnable” state, and the default listed above, only a maximum of 500 of these steps could become scheduled on this pass of the Step Scheduler.
This means 700 steps will be deferred on this pass (at least).  Perhaps some steps will be reviewed (re: processed), and not able to be assigned a resource. Such steps will return to the queue for a subsequent pass.  For the sake of this example, let’s say that 400 of these steps were able to be scheduled, and the other 100 reviewed had no available resource. Again, in this simplified model, let’s assume that no new steps were added to the queue.

The result from Pass #1 of the scheduler would be that our 1200-step queue would have dropped down to a size of only 800.  

But which steps will be reviewed on the subsequent pass?

Well, the first 100 in the queue were looked at in Pass #1, now sit at the top of the queue and will be looked at again in Pass #2, along with the next 400 steps in the queue, (if time permits).

Let’s assume that 200 steps got scheduled this time around, so the queue size is now reduced to 600 steps.

Pass #3 would try to look at the first 500 steps remaining.  

Notice that the scheduler relying on this FIFO model results in the first 3 passes of the scheduler not being able to process the 100 steps most recently added into the queue.

This pattern can create situations where users might wonder why it is taking extra time for their step to be scheduled when the resource to be assigned is seemingly available.

In most environments, each pass of the scheduler will typically be processing only 10s of steps, not 100s, and such passes would complete in less than 1 second of time.  However, in peak load cases where more than 500 steps steps are in the queue, it can take the scheduler many seconds to complete it’s review, which means that the wait time on starting a step sitting lower on the queue may become a noticeable amount of seconds after several such passes. 

The first recommendation for trying to address such situations tends to be to consider adding more resources to the pools which are experiencing bottlenecks (consider the example where 100 steps were deferred).  Also recognize that when the steps near the top of the queue can easily be processsed and assigned a resource, w/o having to wait for an agent to become available, then they essentially “get out of the way” from the newer steps having a chance to be processed by the scheduler. The general conclusion is that having more resources can help your average wait times become acceptable.

Another approach might be to increase the value for resourceSchedulerBatchSize, however doing so will mean that each scheduler pass will take longer to complete during such peak times, which also means delays for those lower down the queue.  So changes here should be done delicately, and will need trial-and-error iterations in one’s environment to understsand the implications during peak hours vs. times where the queue is not so large.

If peak hours are regularly seen at specific hours of the day, then based on your experiments, another approach might be to setup a schedule for a simple procedure that will inject a change to the resourceSchedulerBatchSize prior to the typical peak hour.  For example, you could create a utility procedure that will increase this value at 10:00am every day, and subsequently decrease back to the default at 4:00pm.

Q: How does step priority get factored into this process?

A: Most steps are assigned with “normal” priority. Any steps using High or Low priority will simply adjust the ordering of the steps being placed onto the queue.  Steps arriving with a High priority would move towards the top of the queue for more immediate processing, while a step that was set with a Low priority, will be placed near the bottom of the queue. resulting in potentially longer wait times during such heavy work periods.

Q: Could the scheduler create deadlocks?

A: Generally not for most use cases.  However, one example that should be avoided, would be to mark steps for Low Priority in procedures that use Exclusive resources.  Since an Exclusive resource will essentially lock down the resource for use on this procedure, until it is freed, setting a low priority step inside this exclusive period could cause the resource to be tied to a job whose step may not get cycles for many scheduler passes (until the burst in activity subsides).  Given the resourceSchedulerBatchSize default of 500 steps, such situations are only likely to arise when workloads cause the queue to be larger than this size. Note, that if you were to lower this default, then priority/exclusive blockages could become a problem more often. Hence, avoid Exclusive resources with low-priority work is the simpler rule to remember.

Have more questions?

0 Comments

Please sign in to leave a comment.