- Agent provisioning fails and the Jenkins logs show something similar to the following:
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.162.0.1/api/v1/namespaces/cloudbees-core/pods. Message: Operation cannot be fulfilled on resourcequotas "<resourceQuotaName>": the object has been modified; please apply your changes to the latest version and try again.
- CloudBees Core on Modern Cloud Platforms
- CloudBees Core
- CloudBees Core on Modern Cloud Platforms - Managed Master
- CloudBees Core on Modern Cloud Platforms - Operations Center
- CloudBees Core on Traditional Platforms - Client Master
- CloudBees Core on Traditional Platforms - Operations Center
- CloudBees Jenkins Platform - Client Master
- CloudBees Jenkins Platform - Operations Center
- CloudBees Jenkins Distribution
- Jenkins LTS
- Kubernetes Plugin
- Kubernetes - Resource Quotas
It is more likely to occur when scheduling many pods simultaneously. In Jenkins, this may happen when many agents need to be provisioned at once, i.e. building jobs in bulk or for example using
parallel steps with lots of tasks.
Note: in GKE, ResourceQuotas are automatically apply to every namespaces under certain conditions and cannot be deleted. See https://cloud.google.com/kubernetes-engine/quotas for more details.
There are several workarounds that could help to prevent agent failures caused by this issue:
By default, the Jenkins NodeProvisioner makes its decision based on Load statistics and gives acceptable average Queue waiting time results while preventing over provisioning.
There are however provisioning strategies in Jenkins that aim to boost agent provisioning, in the context of Kubernetes:
NoDelayProvisionerStrategyimplemented by the Kubernetes plugin
- If using CloudBees Core on Modern Cloud Platform, the
KubernetesNodeProvisionerStrategyimplemented by the Kube Agent Management plugin
Those strategies launch agents as soon as they are needed. In a scenario where build tasks are launched in bulk, several agent pods may be scheduled almost simultaneously and provoke this Resource Quotas issue. Whereas the default strategy would give a more gradual behavior.
Therefore a workaround is to disable those provisioning strategies:
- Disable the
NoDelayProvisioningStrategyby adding the system property
-Dio.jenkins.plugins.kubernetes.disableNoDelayProvisioning=trueon master’s startup
- If using CloudBees Core on Modern Cloud Platform, Disable the
KubernetesNodeProvisionerStrategy.by adding the system property
-Dcom.cloudbees.jenkins.plugins.kube.KubernetesNodeProvisionerStrategy.enabled=falseon master’s startup
This would require a restart of the master. See How to add Java arguments to Jenkins?.
Note: Disabling those strategies may result in a “slower” provisioning time overall, that could be negligible depending on the workload of the master.
Avoid scheduling too many builds in bulk. Instead of launching hundreds of tasks at once, throttle the scheduling by for example launching tasks in smaller chunks.
If acceptable and possible:
- remove the Resource Quotas in the namespace where agents are spun up.
*Note: This might not be possible in GKE, see https://cloud.google.com/kubernetes-engine/quotas
In CloudBees Core on Modern Cloud Platform, the Kube Agent Management plugin sets up a exponential backoff period for agent provisioning failure. Jenkins waits a certain amount of time before retrying to provision a specific agent template.
When hitting that kubernetes issue due to the Resource Quotas, the re-provisioning of the failed agent may be considerably delayed by this backoff period. In environments were this is a problem, an additional workaround is to either disable the backoff period or reduce the maximum backoff time that Jenkins should wait before retrying to provision the same agent.
- The maximum backoff period maybe be decreased by adding the system property
- Since version 1.1.32 of the Kube Agent Management plugin, the backoff period may be disabled by adding the system property
Note: This workaround does not prevent Agent failures but prevent delaying the re-provisioning of failed agent.