Timeout when attaching volumes in Kubernetes

Issue

  • A big amount of files in a volume causes a timeout when trying to attach the said volume
  • There’s a timeout when waiting for volumes to attach when using Kubernetes
  • There’s a timeout when trying to attach volumes to a Jenkins instance or Master
  • I am seeing a warning or an event similar to the one below when trying to attach a volume:
  Unable to mount volumes for pod "<POD_IDENTIFIER>": timeout expired waiting for volumes to attach or mount for pod "jenkins"/"<POD_NAME>". list of unmounted volumes=[output]. list of unattached volumes=[output <VOLUMES>]

Environment

Related Issue(s)

Explanation

Since CloudBees CI pod are running as non-root users, Operations Center pod and Controller (formerly Master) pods have the fsGroup set to the Jenkins group 1000 by default. This is so that the volume can be writable by the Pod user.

With this setting, Kubernetes checks and changes the ownership and permissions of all files and directories of each volume mounted to the pod. When a volume is or become very large, this can take a lot of time and slow down the Pod startup. In Kubernetes 1.20, there is beta feature File System Change Policy that can help to reduce the time it takes to set the permissions. See Configure volume permission and ownership change policy for Pods and Kubernetes 1.20: Granular Control of Volume Permission Changes.

When impacted, this shows up as multiple occurrences of timeout expired waiting for volumes to attach or mount for pod in the pod events. However, this exception can happen when the volume backend is not yet provisioned and attached to the host.

Check if Pod is impacted

When seeing this timeout, first check that the volume is correctly provisioned and attached to the hosts - for example in AWS / EKS, check that the EBS volume is attached to the host.

  • If the volume is attached, then this is most likely the problem
  • If the volume is not attached, then this is a different problem related to the provisioning of the external storage and the use of fsGroup is most likely not related.

Workaround

The workaround for the problem is to remove the fsGroup. Go to the Controller item configuration, leave the FS Group field empty and Save. Then Restart the Controller from Operations Center.

Note: It is required to keep the fsGroup to 1000 on volume creation - such as when creating a Controller. It is safe to remove the fsGroup of existing volumes. The recommended strategy is to only remove fsGroup when impacted by this particular problem.

Tested environment

  • AKS - Azure Kubernetes Service
  • EKS - Amazon Kubernetes Service

Have more questions?

0 Comments

Please sign in to leave a comment.