Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some thoughts about the pod/container abstraction in kubelet #12619

Closed
yujuhong opened this issue Aug 12, 2015 · 16 comments
Closed

Some thoughts about the pod/container abstraction in kubelet #12619

yujuhong opened this issue Aug 12, 2015 · 16 comments
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@yujuhong
Copy link
Contributor

This is purely food for thought, as others may not agree and this probably wouldn't be prioritized anytime soon :-)

Kubelet defines Pod in the container runtime, which has and a list of containers and an api.PodStatus object which includes all the container statuses. This seems counter-intuitive to me because Pod is supposed to be a representation of the running pod in the system, i.e., it is essentially the pod status/states (so we have status nested in the status?) It is also not necessary for the container runtime to talk api.PodStatus because kubelet has to fill extra information later on to complete it.

The abstraction is a bit hard to digest alone, and the types seem to center around the implementation details (retrofitting?). Moreover, we pass Pods often without any valid pod/container status. You have no way to tell after-the-fact that whether a container in the pod is running or dead. I'd prefer a flatter struct reflecting the status of the containers/pods, and let kubelet interprets and translates the status into api.PodStatus.

A simple example would be:

type Container struct {
    ID  types.UID
    Name string
    Image string
    Hash uint64
    State ContainerState
    Ready bool
        ...
}
type Pod struct {
    ID types.UID
    Name      string
    Namespace string
    Containers []*Container
    PodIP string
}

/cc @yifan-gu

@yujuhong yujuhong added priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 12, 2015
@yifan-gu
Copy link
Contributor

Interesting, I was also thinking about removing the pod status from it when viewing your pod lifecycle prototype patch :)
Yes, the runtime.Pod we have now can be definitely improved. I think the reason it has the pod status is because it was "inspired by" the api.Pod, (which has a pod status).

@yifan-gu
Copy link
Contributor

FWIW I am working on some other projects this week. But should be able to get back on rkt patches and kubelet refactoring quickly.

@dchen1107
Copy link
Member

@yujuhong You are very right on this. That ugly abstraction was introduced due to the legacy interface and implementation of rkt then. @yifan-gu maybe now is the time to clean up this?

@yujuhong
Copy link
Contributor Author

Interesting, I was also thinking about removing the pod status from it when viewing your pod lifecycle prototype patch :)

Good to know that we're on the same page!

FWIW I am working on some other projects this week. But should be able to get back on rkt patches and kubelet refactoring quickly.

Sounds good. I will work on converting my proposal to a more concise github PR with sub-issues, and see if you are interested in taking on any of those. We should touch base sometime next week.

I think we need to clean this up before adding a generic runtime pod cache from my proposal. It'd certainly make things a lot easier :)

@yifan-gu
Copy link
Contributor

@yujuhong Thank you! More than happy to help as I am back working on rkt fixes now, and would like to take the chance to refactor and clean up those history problems :)

@yifan-gu
Copy link
Contributor

Going to refactor on this if you haven't already started :)

@yujuhong
Copy link
Contributor Author

After thinking about it more, I propose that we add a new type RawPodStatus in the container runtime. The RawPodStatus represents the current state of the pod in the container runtime, without knowing the original pod spec.

At kubelet startup, it can ask the container runtime for the RawPodStatus of all pods running on the node, and even populate the cache. It does not need to wait until kubelet receives all the api.Pod object from the sources. This would simplify the bootstrapping sequence.

What RawPodStatus will look like:

type RawPodStatus struct {
        ID                types.UID
        Name              string
        Namespace         string
        IP                string
        ContainerStatuses []RawContainerStatus
}

type RawContainerStatus struct {
        ID           ContainerID
        Name         string
        Status       ContainerStatus
        CreatedAt    time.Time
        StartedAt    time.Time
        FinishedAt   time.Time
        ExitCode     int
        Image        string
        ImageID      string
        RestartCount int
        Reason       string
        Message      string
}
  • Unlike api.PodStatus, RawPodStatus lists the RawContainerStatus for every running and non-running containers, so that other kubelet components can easily handle them.
  • One can easily derive api.PodStatus from RawPodStatus for pod status updates.
  • Unlike Pod and Container, RawPodStatus includes the complete information of the pod and its containers on the node. Hence we can simplify the code in various places,
    • E.g., in SyncPod(), instead of passing a Pod and an api.PodSatus, we only need to pass RawPodStatus.

The plan is to populate the RawPodStatus cache by periodic relisting (and/or container event stream), and let the pod workers consume the cache content to avoid hitting docker/rkt directly.

@yifan-gu, thoughts?

@yujuhong
Copy link
Contributor Author

Also cc @kubernetes/goog-node

The specific problem is that

  • SyncPod relies on api.PodStatus as its input (and source of container statuses)
  • Generating api.PodStatus requires api.Pod (a.k.a, the pod spec)

To populate the cache for SyncPod to consume, PLEG would need to generate api.PodStatus, i.e., it needs access to the pod specs (api.Pod). We could pass the pod manager to the PLEG, adding an extra dependency. However, this also means that when kubelet restarts, PLEG would wait indefinitely for the pod specs to arrive in order to populate the cache. I don't like the ugly dependency, and adding RawPodStatus eliminates such needs. This does need more refactoring though.

@yifan-gu
Copy link
Contributor

Sounds reasonable. Another thing I can take from this is seems the relisting of podstatus requires one List + n Inspect, so I think for rkt api-service, I might need to add an option(e.g. detail=true) in the request, so the ListPod() can return as much as info as possible to avoid a loop of InspectPod

@yujuhong
Copy link
Contributor Author

Hmm...you are right. For rkt, listing can actually provide more information, but I am not sure if we can optimize that in a generic, runtime-agnostic implementation...

When relisting,

  1. GetPodsto gets all the high-level pod information
  2. GetRawPodStatus gets the detailed information for each of the pod that looks different than the last time.

This is a compromise between docker and rkt. For docker, GetRawPodstatus would list and inspect multiple containers, while in fact, docker only needs to inspect one container. This might becomes worse when we use container event stream, as we may call GetPodStatus for each container in the pod. As for rkt, GetRawPodStatus is actually redundant since GetPods could potentially reports all information.

Unless we are going to implement caches in individual container runtimes, compromises are unavoidable :-(

@yifan-gu
Copy link
Contributor

@yujuhong Sorry for the confusion, but actually I mean the InspectPod() and ListPods() for rkt's own api. So for kubelet's ListPods(), it can invoke rkt.ListPods(simple), and for kubelet.GetRawPodStatus(), it can invoke rkt.InspectPod(). If we want to get all raw pod statuses, it can do rkt.ListPods(detailed).

@yujuhong
Copy link
Contributor Author

Sorry for the confusion, but actually I mean the InspectPod() and ListPods() for rkt's own api. So for kubelet's ListPods(), it can invoke rkt.ListPods(simple), and for kubelet.GetRawPodStatus(), it can invoke rkt.InspectPod(). If we want to get all raw pod statuses, it can do rkt.ListPods(detailed).

Understood. Thanks for the explanation.

Adding the raw pod status would involve some significant refactoring. I have the first batch of WIP commits to add this in the runtime interface, along with the implementation in dockertools https://github.com/yujuhong/kubernetes/tree/raw_status. If we decide to do it, the TODOs will be

  • Add the implementation in rkt too
  • Change kubelet code to use RawPodStatus for syncing
  • Add the pod cache for RawPodStatus

@Random-Liu
Copy link
Member

Hmm...My understanding is that currently we do generatePodStatus() before and after SyncPod(). However, only the "runtime related status" is needed during SyncPod(). Something like Phase, Condition etc. not only rely on the PodSpec but also is never used in SyncPod().

So in fact there should be two kinds of "PodStatus": one shows the up-to-date runtime status[Needed in SyncPod() or other kubelet components]; another is generated before reporting PodStatus to apiserver. Then, the former one is the so-called "RawPodStatus".

My benchmark result shows that it is really inefficient to let each pod worker access the runtime independently. So if we have RawPodStatus, the PLEG can populate a new "runtimeCache" which maintains the up-to-date RawPodStatus by periodically re-listing or leveraging the runtime event stream, so as to stop pod workers accessing the runtime independently.

Do I understand the story right? :)

@yujuhong
Copy link
Contributor Author

@Random-Liu, your summary is correct. api.PodStatus includes high-level information that are irrelevant to the container runtime. This kind of information can be handled by kubelet itself. SyncPod would only need RawPodStatus (the actual state) AND api.Pod (the desired state) to do its job :)

@Random-Liu
Copy link
Member

@yujuhong Ha, thanks for your explanation. :)

@yujuhong
Copy link
Contributor Author

yujuhong commented Mar 6, 2016

We've finished all the TODO items on this issue. Closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

4 participants