New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pod stuck with NodeAffinity status // using spot VMs under K8s 1.22.x
and 1.23.x
#112333
Comments
@gillesdouaire: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1.22.3-gke.700
1.22.3-gke.700
/sig node We are aware it is supposed to be fixed as of k8s 1.21, but we experienced it in the same context but under newer K8s version. All pods on a given node are stuck with NodeAffinity status, and will remain so until deleted, after which action they will be re-scheduled. The node is ready and otherwise healthy. k8s version: v1.22.12-gke.1200 |
1.22.3-gke.700
1.22.3-gke.700
1.22.3-gke.700
1.22.x
BTW, End of Life for 1.22 is 2022-10-28. How can we reproduce it? I cannot reproduce it by just restarting kubelet in my v1.24 cluster. |
@pacoxu In my case, I was able to reproduce the situation on our 1.22 GKE cluster by issueing a few "kube nodes delete... --force" commands on our existing KM nodes hosted on spot VMS / each time, waiting for a new node to respawn, stabilize, then del;ete again, It took 6 delete commands before pods stuck in NodeAffinity state appeared. As mentioned before, all the pods stuck in NodeAffinity are assigned to the same node. Right now, I am leaving a few pods in that state, so if you need more details on the actual status of the workloads, let me know. |
@pacoxu Using the "kube nodes delete... --force" approach, I was able to reproduce on a Kubernetes cluster running 1.23.10. Same behaviour: all the pods stuck in NodeAffinity are assigned to the same node and remain unready. |
1.22.x
1.22.x
and 1.23.x
Does this mean that you delete a node and restart the kubelet several times?
Can you share a sample pod yaml on that node with NodeAffinity state? |
Only the first step; once the node is force deleted, kubernetes will have a new node respawn, and then pods will be reassigned correctly OR will fall in the NodeAffinity state. The pods I had left in NodeAffinity state have been flushed (spot VMs were restarted), so I will need to re-generate a case, will post it here as soon as I have the data from a pod yaml. |
Good and/or bad news: I've seen the NodeAffinity status occur once under K8s 1.23, but now I have trouble reproducing. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
+1, after adding a new preemptible (not spot) node pool to a 1.23.13-gke.900 cluster and scheduling a deployment there I've also noticed this behaviour on the first couple preemptions /remove-lifecycle stale |
@jonpulsifer Do you mind to give some more evidence on how to reproduce the issue?
The process is hanging for no stable steps to reproduce the issue in my mind. |
+1 1.23.14-gke.401/1.23.12-gke.100 |
@SimSimY do you have some more details? |
This happens with 1.25.8-gke.500 as well. Steps to reproduce:
|
I can confirm this also happens in our clusters running 1.25.8-gke.1000. Seems to happen when spot VMs are pre-empted, but only occasionally. Here's sample events from
|
Same experience on This was also happening on Google says this is 'fixed' from Sliced screenshot output of the equivalent of |
Still able to reproduce on control plane 1.25.12-gke.500 / preemptible nodepool 1.25.12-gke.500 NodeAffinity status pods look like:
|
Are you seeing that all the pods stuck in "the node affinity" status have GKE has a fix for this issue to automatically clean up terminal pods on VM preemption that is available from control plane version 1.27.2-gke.1800+. Please try that out for the long term fix. Since this is a GKE specific issue, please reach out to GKE support if you continue to have issues. Thanks! |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The same problem on
1.22.3-gke.700
Originally posted by @maxpain in #98534 (comment)
The text was updated successfully, but these errors were encountered: