Issue
When worker nodes crash controller/Operations center pods are not moved to a different worker node.
Controller/Operations center pods get stuck in terminating /unknown state.
kubectl get pods -n $NAMESPACE -o wide | grep -i Terminating
ControlerTest1-0 1/1 Terminating 0 7d11h 10.42.17.34 somenode1 <none> <none>
ControlerTest2-0 1/1 Terminating 0 7d11h 10.42.17.35 somenode1 <none> <none>
ControlerTest3-0 1/1 Terminating 0 7d10h 10.42.17.36 somenode1 <none> <none>
Environment
- CloudBees CI (CloudBees Core)
- CloudBees CI (CloudBees Core) on modern cloud platforms - Managed Master
- CloudBees CI (CloudBees Core) on modern cloud platforms - Operations Center
Resolution
This happens when Kubernetes worker node loses connectivity to the API server. Kubernetes (versions 1.5 or newer) will not delete Pods just because a Node is unreachable.
The Pods running on an unreachable Node enter the ‘Terminating’ or ‘Unknown’ state after a timeout. Pods may also enter these states when the user attempts graceful deletion of a Pod on an unreachable Node.
In this case the pod still remains in the API server and hence a new pod is not scheduled since statefulset requires pod maintain a unique id within the cluster.
Workaround
A pod in the terminating state will be removed through below actions. This should allow pod to be rescheduled as any record of it is removed from api server.
- The Node object is deleted (either by you, or by the Node Controller).
- The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.
- Force deletion of the Pod by the user (kubectl delete pods
–grace-period=0 –force). This should be used as last resort.
References
https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/
0 Comments