AWS workers show banned volumes

Issue

Managed Masters fail to provision or take too long to provision. Eventually, the worker is not able to provision anything and has to be replaced by a new one.

After experiencing some Managed Master provisioning issues, we can see in the syslogs entries similar with the ones shown below:

com.cloudbees.dac.castle.VolumeDeviceUtils$BadDevices ban
FINE: Banned device /dev/sdj
com.cloudbees.dac.castle.EbsBackend lambda$tagAndMountVolume$4
SEVERE: Error attaching volume

And

WARNING: Timed out waiting for volume vol-XXXXXXXXXX. Detaching and banning device /dev/sdj in instance i-XXXXXXXXXX

The device is banned by Castle service, and added to a list that can be located inside of a given worker in the path: /tmp/castle/baddevices/dev/.

In order to discard that Castle is wrongly marking the volume as banned, you need to verify that the timeout is not due to a problem with the AWS API calls frequency configured in Castle.

To verify this, you should review the CloudWatch events for the period corresponding with the log messages, and if you see a list of consecutive and very close events likke the one shown below:

 DetachVolume              2019-01-03T09:XX:39.000Z - Event ID xxx-yyy-dddd
 DetachVolume              2019-01-03T09:XX:38.000Z - Event ID xxx-aaa-dddd
 DetachVolume              2019-01-03T09:XX:37.000Z - Event ID xxx-ccc-ffff

Once that you have confirmed that you have CloudWatch entries similar to the ones above, the most possible cause for the issue is that Castle retry properties are not correctly set.

Environment

Resolution

This behavior can be considered normal as Castle will try to attach a given volume to the worker, and once that the operation times out the volume/device is marked as banned so that Castle can safely ignore it while provisioning new masters/applications.

As mentioned in the Issue statement section, the most possible cause for the issue is that Castle retry properties are not correctly set. These properties can be found in the following file: .dna/servers/castle/dna.config or in .dna/project.config under the [castle] section.

Please, find below the steps needed to correct the values:

  • In your CloudBees Jenkins Enterprise project, edit the DNA properties file .dna/project.config
  • Find the [castle] section and the jvm_options property.
  • Add each property that you would like to set to the jvm_options property as a Java property, i.e. using the -D option. In this particular case, we would need to alter two properties:
  • com.cloudbees.dac.castle.util.AWSUtils.retryMaximumTimeSeconds with recommended value of 30s.
  • com.cloudbees.dac.castle.util.AWSUtils.retryAttempts with a recommended value of 20 times.

e.g.:

 jvm_options=-Dcom.cloudbees.dac.castle.util.AWSUtils.retryMaximumTimeSeconds=30 -Dcom.cloudbees.dac.castle.util.AWSUtils.retryAttempts=20
  • Force update to project by running the following command from your Bastion host console cje upgrade --config-only --force.
  • Finally, reinit Castle by running dna reinit castle.

Workaround

If you don’t want to follow the steps above, there is a temporal workaround available that can help you overcome this kind of provisioning issue for a given worker:

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.