Controller will not start after modifying namespace

Issue

If you stop a currently running controller (team or managed controller), and modify the Namespace field to some invalid value and hit Save, then Acknowledge error, you will then see that the controller fails to start (which is expected since the namespace does not currently exist), and the startup logs will be similar to:

[Tue May 25 18:56:48 UTC 2021] Stopping master: cloudbees-ci/test
[Tue May 25 18:56:48 UTC 2021] Deleting service cloudbees-ci/test
[Tue May 25 18:56:48 UTC 2021] Deleting ingress cloudbees-ci/test
[Tue May 25 18:56:48 UTC 2021] Deleting stateful set cloudbees-ci/test
[Tue May 25 18:56:48 UTC 2021][Normal][Ingress][test][DELETE] Ingress cloudbees-ci/test
ERROR: Could not request to expand disk
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://10.96.0.1/api/v1/namespaces/a/persistentvolumeclaims/jenkins-home-test-0. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. persistentvolumeclaims "jenkins-home-test-0" is forbidden: User "system:serviceaccount:cloudbees-ci:cjoc" cannot get resource "persistentvolumeclaims" in API group "" in the namespace "a".
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:570)

The problem is that you will notice that you are now unable to modify the Namespace value to change it back to a valid value.

Environment

Resolution

This is a bug that is planned to be fixed in an upcoming product release, under:

BEE-5019 Disable modification of namespace field when a volume exists

Workaround

To recover from this issue, you can:
1. Backup the controller data from the Kubernetes PV Using a rescue-pod
1. Backup the settings for the controller (the startup arguments, which docker image, disk space, cpu allocation) from /var/jenkins_home/jobs/controller-name/config.xml from the operations center filesystem
1. If it’s a Teams controller, backup the /var/jenkins_home/jobs/Teams/jobs/team-name/teamSecurity.xml from the operations center filesystem
1. If it’s a Managed controller, backup the /var/jenkins_home/jobs/controller-name/nectar-rbac.xml from the operations center filesystem
1. Ensure that the reclaim policy of the Persistent Volume for your controller is set to Retain, and not Delete. To check this, run kubectl get pv and look under the RECLAIM POLICY column for the jenkins-home-$CONTROLLER_NAME-0 claim. If the RECLAIM POLICY is Delete, change it to Reclaim by following Changing the reclaim policy of a PersistentVolume
1. Delete the controller
1. Create a new controller with the same settings as before (with the correct namespace field)
1. Restore the data in the Kubernetes PV. This step can be skipped if you successfully set the reclaim policy to Retain in the previous step, all the data will still be in the PV, and since you chose the same name for the controller, the same PV will be used. If it’s missing, restore it by Using a rescue-pod.
1. If it’s a Teams controller, restore the /var/jenkins_home/jobs/Teams/jobs/team-name/teamSecurity.xml from the operations center filesystem
1. If it’s a Managed controller, restore the /var/jenkins_home/jobs/controller-name/nectar-rbac.xml from the operations center filesystem

Tested product/plugin versions

Workaround tested with CloudBees CI on Modern cloud platforms 2.277.4.3

Have more questions?

0 Comments

Please sign in to leave a comment.