Managed Masters fail to provision or take too long to provision. Eventually, the worker is not able to provision anything and has to be replaced by a new one.
After experiencing some Managed Master provisioning issues, we can see in the syslogs entries similar with the ones shown below:
com.cloudbees.dac.castle.VolumeDeviceUtils$BadDevices ban FINE: Banned device /dev/sdj com.cloudbees.dac.castle.EbsBackend lambda$tagAndMountVolume$4 SEVERE: Error attaching volume
WARNING: Timed out waiting for volume vol-XXXXXXXXXX. Detaching and banning device /dev/sdj in instance i-XXXXXXXXXX
The device is banned by Castle service, and added to a list that can be located inside of a given worker in the path:
In order to discard that Castle is wrongly marking the volume as banned, you need to verify that the timeout is not due to a problem with the AWS API calls frequency configured in Castle.
To verify this, you should review the CloudWatch events for the period corresponding with the log messages, and if you see a list of consecutive and very close events likke the one shown below:
DetachVolume 2019-01-03T09:XX:39.000Z - Event ID xxx-yyy-dddd DetachVolume 2019-01-03T09:XX:38.000Z - Event ID xxx-aaa-dddd DetachVolume 2019-01-03T09:XX:37.000Z - Event ID xxx-ccc-ffff
Once that you have confirmed that you have CloudWatch entries similar to the ones above, the most possible cause for the issue is that Castle retry properties are not correctly set.
- CloudBees Jenkins Enterprise
- CloudBees Jenkins Enterprise - Managed Master
- CloudBees Jenkins Enterprise - Operations Center
This behavior can be considered normal as Castle will try to attach a given volume to the worker, and once that the operation times out the volume/device is marked as banned so that Castle can safely ignore it while provisioning new masters/applications.
As mentioned in the Issue statement section, the most possible cause for the issue is that Castle retry properties are not correctly set. These properties can be found in the following file:
.dna/servers/castle/dna.config or in
.dna/project.config under the
Please, find below the steps needed to correct the values:
- In your CloudBees Jenkins Enterprise project, edit the DNA properties file
- Find the
[castle]section and the
- Add each property that you would like to set to the jvm_options property as a Java property, i.e. using the -D option. In this particular case, we would need to alter two properties:
com.cloudbees.dac.castle.util.AWSUtils.retryMaximumTimeSecondswith recommended value of 30s.
com.cloudbees.dac.castle.util.AWSUtils.retryAttemptswith a recommended value of 20 times.
- Force update to project by running the following command from your Bastion host console
cje upgrade --config-only --force.
- Finally, reinit Castle by running
dna reinit castle.
If you don’t want to follow the steps above, there is a temporal workaround available that can help you overcome this kind of provisioning issue for a given worker: