Controller/Operations center pods stuck in terminating state

Issue

When worker nodes crash controller/Operations center pods are not moved to a different worker node.
Controller/Operations center pods get stuck in terminating /unknown state.

 kubectl get pods -n $NAMESPACE -o wide | grep -i Terminating
ControlerTest1-0 1/1 Terminating 0 7d11h 10.42.17.34 somenode1 <none> <none>
ControlerTest2-0 1/1 Terminating 0 7d11h 10.42.17.35 somenode1 <none> <none>
ControlerTest3-0 1/1 Terminating 0 7d10h 10.42.17.36 somenode1 <none> <none>

Environment

Resolution

This happens when Kubernetes worker node loses connectivity to the API server. Kubernetes (versions 1.5 or newer) will not delete Pods just because a Node is unreachable.
The Pods running on an unreachable Node enter the ‘Terminating’ or ‘Unknown’ state after a timeout. Pods may also enter these states when the user attempts graceful deletion of a Pod on an unreachable Node.
In this case the pod still remains in the API server and hence a new pod is not scheduled since statefulset requires pod maintain a unique id within the cluster.

Workaround

A pod in the terminating state will be removed through below actions. This should allow pod to be rescheduled as any record of it is removed from api server.

  • The Node object is deleted (either by you, or by the Node Controller).
  • The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.
  • Force deletion of the Pod by the user (kubectl delete pods –grace-period=0 –force). This should be used as last resort.

References

https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/

Have more questions?

0 Comments

Please sign in to leave a comment.