Skip to content

PodresourceAPI reports about resources of pods in terminal phase #119423

@Tal-or

Description

@Tal-or

What happened?

PodResourcesAPI provide a List() endpoint which reports about all the resources that consumed by pods and containers on the node.

The problem is that pods which are in terminal phase (i.e. are in Failed or Succeeded status) are reported as well. about The internal managers reassign resources assigned to pods in terminal phase, so PodResources should ignore them, because they can still be in used.

What did you expect to happen?

PodResources should ignore and not reports about resources which are in used by pods which are in terminal phase.

How can we reproduce it (as minimally and precisely as possible)?

Provided a new test-case that demonstrates the exact problem and can be used as a reproducer: #119402

Anything else we need to know?

The docs that describes PodResourceAPI are not refer explicitly to whether List() should ignore terminal pod's resources or not, meaning it might not be a bug at all.

OTOH, internal managers reassign resources assigned to pods in terminal phase so it make sense to not account them.

Kubernetes version

Kubernetes v1.26.2

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Activity

added
kind/bugCategorizes issue or PR as related to a bug.
on Jul 19, 2023
added
needs-sigIndicates an issue or PR lacks a `sig/foo` label and requires one.
needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.
on Jul 19, 2023
ffromani

ffromani commented on Jul 19, 2023

@ffromani
Contributor

/sig node

added
sig/nodeCategorizes an issue or PR as relevant to SIG Node.
and removed
needs-sigIndicates an issue or PR lacks a `sig/foo` label and requires one.
on Jul 19, 2023
SergeyKanzhelev

SergeyKanzhelev commented on Jul 19, 2023

@SergeyKanzhelev
Member

/triage accepted

added
triage/acceptedIndicates an issue or PR is ready to be actively worked on.
and removed
needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.
on Jul 19, 2023
carlory

carlory commented on Jul 21, 2023

@carlory
Member

/assign

ffromani

ffromani commented on Jul 27, 2023

@ffromani
Contributor

I was thinking about this issue for a bit, and the root cause is probably that there is not a (good enough) formal spec for the API itself so the implementation is pretty much the reference.
We should probably improve here but not sure how (e.g. where to describe more formally the API, perhaps just in the docs?)

That said, my thoughts about the ideal semantics are:

  1. Considering the docs as the most accurate, possibly the only, spec "The List endpoint provides information on resources of running pods, with details such as the id of exclusively allocated CPUs, device id as it was reported by device plugins and id of the NUMA node where these devices are allocated. Also, for NUMA-based machines, it contains the information about memory and hugepages reserved for a container." This alone should be sufficient to exclude pods in terminal phase from the running set.
  2. I need to doublecheck the actual behavior of the resource managers. If resources assigned to containers are NOT released until the containers are actually cleared by the system, then the API is returning the correct data because these resources can't be re-allocated to newly admitted containers, and the accounting must reflect that
  3. But this brings us to the question: is the allocation behavior of the resource manager documented? should it?
SergeyKanzhelev

SergeyKanzhelev commented on Jul 28, 2023

@SergeyKanzhelev
Member

I wonder if pod resources API may report the same set of resources used by two pods - one succeeded and one newly scheduled in a single response?

@ffromani should we try to summarize shortcomings at the page you listed as a starting point? We can start with the list of edge cases like:

  • "You cannot rely on data completeness while kubelet is starting"
  • "Succeeded or Failed pods may be listed"
  • "When container has failed, can resources still be allocated?"
  • "Is pod information guaranteed visible when NRI plugins are being called for this pod?"
  • "what about init containers resources? When it will be returned and when not?"

I like the idea that podresources API is driven by the fact whether resources are actually in use, not by the Pod status" Thus the resource manager behavior ideally needs to be documented.

65 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.sig/nodeCategorizes an issue or PR as relevant to SIG Node.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    Status

    Triaged

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @SergeyKanzhelev@swatisehgal@k8s-ci-robot@ffromani@carlory

      Issue actions

        PodresourceAPI reports about resources of pods in terminal phase · Issue #119423 · kubernetes/kubernetes