Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PodresourceAPI reports about resources of pods in terminal phase #119423

Open
Tal-or opened this issue Jul 19, 2023 · 10 comments · May be fixed by #119553
Open

PodresourceAPI reports about resources of pods in terminal phase #119423

Tal-or opened this issue Jul 19, 2023 · 10 comments · May be fixed by #119553
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Tal-or
Copy link
Contributor

Tal-or commented Jul 19, 2023

What happened?

PodResourcesAPI provide a List() endpoint which reports about all the resources that consumed by pods and containers on the node.

The problem is that pods which are in terminal phase (i.e. are in Failed or Succeeded status) are reported as well. about The internal managers reassign resources assigned to pods in terminal phase, so PodResources should ignore them, because they can still be in used.

What did you expect to happen?

PodResources should ignore and not reports about resources which are in used by pods which are in terminal phase.

How can we reproduce it (as minimally and precisely as possible)?

Provided a new test-case that demonstrates the exact problem and can be used as a reproducer: #119402

Anything else we need to know?

The docs that describes PodResourceAPI are not refer explicitly to whether List() should ignore terminal pod's resources or not, meaning it might not be a bug at all.

OTOH, internal managers reassign resources assigned to pods in terminal phase so it make sense to not account them.

Kubernetes version

Kubernetes v1.26.2

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@Tal-or Tal-or added the kind/bug Categorizes issue or PR as related to a bug. label Jul 19, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 19, 2023
@ffromani
Copy link
Contributor

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 19, 2023
@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs Jul 19, 2023
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Triaged in SIG Node Bugs Jul 19, 2023
@SergeyKanzhelev
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 19, 2023
@carlory
Copy link
Member

carlory commented Jul 21, 2023

/assign

@ffromani
Copy link
Contributor

ffromani commented Jul 27, 2023

I was thinking about this issue for a bit, and the root cause is probably that there is not a (good enough) formal spec for the API itself so the implementation is pretty much the reference.
We should probably improve here but not sure how (e.g. where to describe more formally the API, perhaps just in the docs?)

That said, my thoughts about the ideal semantics are:

  1. Considering the docs as the most accurate, possibly the only, spec "The List endpoint provides information on resources of running pods, with details such as the id of exclusively allocated CPUs, device id as it was reported by device plugins and id of the NUMA node where these devices are allocated. Also, for NUMA-based machines, it contains the information about memory and hugepages reserved for a container." This alone should be sufficient to exclude pods in terminal phase from the running set.
  2. I need to doublecheck the actual behavior of the resource managers. If resources assigned to containers are NOT released until the containers are actually cleared by the system, then the API is returning the correct data because these resources can't be re-allocated to newly admitted containers, and the accounting must reflect that
  3. But this brings us to the question: is the allocation behavior of the resource manager documented? should it?

@SergeyKanzhelev
Copy link
Member

I wonder if pod resources API may report the same set of resources used by two pods - one succeeded and one newly scheduled in a single response?

@ffromani should we try to summarize shortcomings at the page you listed as a starting point? We can start with the list of edge cases like:

  • "You cannot rely on data completeness while kubelet is starting"
  • "Succeeded or Failed pods may be listed"
  • "When container has failed, can resources still be allocated?"
  • "Is pod information guaranteed visible when NRI plugins are being called for this pod?"
  • "what about init containers resources? When it will be returned and when not?"

I like the idea that podresources API is driven by the fact whether resources are actually in use, not by the Pod status" Thus the resource manager behavior ideally needs to be documented.

@ffromani
Copy link
Contributor

@SergeyKanzhelev I totally agree that the behavior should be documented in a more precse way. I'm just not sure that page is the best place for reference docs. I'll check with SIG-docs and get advice.

@ffromani
Copy link
Contributor

ffromani commented Jul 28, 2023

I wonder if pod resources API may report the same set of resources used by two pods - one succeeded and one newly scheduled in a single response?

From top of my head, I'm not sure we have a code path which prevents this. It should be extremely unlikely (at very least you will need a very limited amount of resources, bypass the scheduler entirely and hit a race within the kubelet) but possible. Should also be transient - the next query should return the expected consistent result. But still, I'm not sure there's prevention for this flow. Also depends on the specifics of resource managers.

The gist is that the podresources API is a thin layer which reformats and exposes the internal data of resource managers. This makes the code very simple, but leaks pretty badly the internal implementation of the resource managers allocation process.

This in turns makes the API not really formally specified and with perhaps poor stability guarantees.

@ffromani
Copy link
Contributor

ffromani commented Aug 1, 2023

I think the best approach would be extending https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubelet/pkg/apis/podresources/v1/api.proto#L36 to enable the client side to filter out or not pod in terminal phase. This way we fix the issue AND we preserve backward compatible.

This of course assuming we can extend the message in a backward compatible way. It should be the case, but we need to doublecheck.

@swatisehgal
Copy link
Contributor

swatisehgal commented Aug 1, 2023

+1 I like the ability to allow the client to filter the result based on pod phase or in general exposing the pod phase allowing the client to filter out the pods themselves.

@ffromani
Copy link
Contributor

ffromani commented Aug 1, 2023

The next question would be if such a change deserves a KEP or not. I quite strongly feel that this would still fall under the "bugfix" category - we had a rich history about "invasive" and sometimes behavior-changing fixes, and this change hardly look as a RFE.
But before to move forward with that we need a better spec of the API behavior and more formalized guarantees. That would be the next step.

ffromani added a commit to ffromani/scheduler-plugins that referenced this issue Oct 10, 2023
Add an option to create and use a separate informer
to get unfiltered pod listing from the apiserver.

Due to mismatched view with respect to the kubelet,
the plugin needs access to pod in terminal phase
which are not deleted to make sure the reconciliation
is done correctly.

xref: kubernetes-sigs#598
xref: kubernetes/kubernetes#119423

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/scheduler-plugins that referenced this issue Oct 10, 2023
Add an option to create and use a separate informer
to get unfiltered pod listing from the apiserver.

Due to mismatched view with respect to the kubelet,
the plugin needs access to pod in terminal phase
which are not deleted to make sure the reconciliation
is done correctly.

xref: kubernetes-sigs#598
xref: kubernetes/kubernetes#119423

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/scheduler-plugins that referenced this issue Nov 30, 2023
Add an option to create and use a separate informer
to get unfiltered pod listing from the apiserver.

Due to mismatched view with respect to the kubelet,
the plugin needs access to pod in terminal phase
which are not deleted to make sure the reconciliation
is done correctly.

xref: kubernetes-sigs#598
xref: kubernetes/kubernetes#119423

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/scheduler-plugins that referenced this issue Dec 5, 2023
Add an option to create and use a separate informer
to get unfiltered pod listing from the apiserver.

Due to mismatched view with respect to the kubelet,
the plugin needs access to pod in terminal phase
which are not deleted to make sure the reconciliation
is done correctly.

xref: kubernetes-sigs#598
xref: kubernetes/kubernetes#119423

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/scheduler-plugins that referenced this issue Jan 17, 2024
Add an option to create and use a separate informer
to get unfiltered pod listing from the apiserver.

Due to mismatched view with respect to the kubelet,
the plugin needs access to pod in terminal phase
which are not deleted to make sure the reconciliation
is done correctly.

xref: kubernetes-sigs#598
xref: kubernetes/kubernetes#119423

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to openshift-kni/scheduler-plugins that referenced this issue Jan 19, 2024
Add an option to create and use a separate informer
to get unfiltered pod listing from the apiserver.

Due to mismatched view with respect to the kubelet,
the plugin needs access to pod in terminal phase
which are not deleted to make sure the reconciliation
is done correctly.

xref: kubernetes-sigs#598
xref: kubernetes/kubernetes#119423

Signed-off-by: Francesco Romani <fromani@redhat.com>
(cherry picked from commit e466bc4)
ffromani added a commit to openshift-kni/scheduler-plugins that referenced this issue Jan 23, 2024
Add an option to create and use a separate informer
to get unfiltered pod listing from the apiserver.

Due to mismatched view with respect to the kubelet,
the plugin needs access to pod in terminal phase
which are not deleted to make sure the reconciliation
is done correctly.

xref: kubernetes-sigs#598
xref: kubernetes/kubernetes#119423

Signed-off-by: Francesco Romani <fromani@redhat.com>
(cherry picked from commit e466bc4)
salmanyam pushed a commit to mvle/scheduler-plugins that referenced this issue Mar 5, 2024
Add an option to create and use a separate informer
to get unfiltered pod listing from the apiserver.

Due to mismatched view with respect to the kubelet,
the plugin needs access to pod in terminal phase
which are not deleted to make sure the reconciliation
is done correctly.

xref: kubernetes-sigs#598
xref: kubernetes/kubernetes#119423

Signed-off-by: Francesco Romani <fromani@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
6 participants