Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet: Pod level Resource Management #26751

Merged
merged 1 commit into from Jul 25, 2016

Conversation

dubstack
Copy link

@dubstack dubstack commented Jun 2, 2016

This proposal outlines our plan for improving resource management in Kubernetes by having a Cgroup hierarchy with QoS and Pod level Cgroups.

This is the initial proposal which broadly covers our goals and how we plan to achieve it. At this point we would really appreciate feedback from the community.

This is tied to the upstream issue #5671. So i would request
@vishh @dchen1107 @bgrant0607 @jdef PTAL.

Analytics

@k8s-bot
Copy link

k8s-bot commented Jun 2, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

2 similar comments
@k8s-bot
Copy link

k8s-bot commented Jun 2, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jun 2, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jun 2, 2016
@vishh
Copy link
Contributor

vishh commented Jun 2, 2016

ok to test

@vishh
Copy link
Contributor

vishh commented Jun 2, 2016

cc @kubernetes/sig-node


We have the following resource parameters for the `Bu` cgroup.

'/Bu/cpu.shares = summation of all cpu requests <br>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ' typo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh @dchen1107 - when we discussed this in our F2F, we had also discussed Bu would be a child of G? any reason for the change or am I recalling incorrectly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NM, I remember we discussed both flat and hierarchical proposals, and this is presenting the flat.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to go with a flat hierarchy at the end.

@derekwaynecarr
Copy link
Member

It would be good if this proposal can give an outline of the entire system cgroup hierarchy.

For example, it should include something more akin to:
https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/kubelet-systemd.md#proposed-cgroup-hierarchy

I think we also need a description on how we would at minimum enforce CPU shares for system-reserved and kube-reserved. Otherwise, they will get the default shares value of 1024 which may or may not reflect desired outcome.


/BE/cpu.shares = 2 <br>
/BE/cpu.quota= not set
/BE/m.limit_in_bytes = Allocatable - summation of memory requests/limits of guaranteed pods <br>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocatable - summation of memory requests for G and Bu pods

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this will have an interesting side effect that we should clarify.

The kubelet is more likely to OOM kill existing BE pods as more G and Bu pods are scheduled even if the G and Bu pods are using less than their request since we will have to dynamically reduce the size of BE m.limit_in_bytes.

The existing eviction behavior will not cause those BE pods to get moved since they will not be able to hog memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. As G pods are added, BE pods are more susceptible to being OOM Killed. This policy attempts to provide best performance for G pods. In the future, we can let users configure the level of over-commit.

@dubstack
Copy link
Author

dubstack commented Jun 6, 2016

@derekwaynecarr Thanks a lot for your comments. Yeah I too feel that a a general cgroup hierarchy outline similar to one in the kubelet-systemd propasal would be a really good idea.

@davidopp
Copy link
Member

davidopp commented Jul 9, 2016

Just realized I never followed up on this...
@dubstack @dchen1107 @vishh and I discussed this a few weeks ago and decided to make the thing I was talking about be configurable (per cluster) rather than hard-coded. So you can decide whether you value utilization more or very predictable tail latency for latency-sensitive servers. I think we decided the default would be higher utilization.

@dubstack
Copy link
Author

dubstack commented Jul 9, 2016

Yes I would be adding a section highlighting the change. Something along the lines of this:

We add a flag qos-memory-overcommitment to kubelet which would allow users to configure the percentage of memory overcommitment on the node. We have the default as 100, so by default we allow complete overcommitment on the node and let the BE pod use as much memory as it wants, and not reserve any resources for the G and Bu pods. As expected if there is an OOM in such a case we first kill the BE pods before the G and Bu pods. On the other hand if a user wants to ensure very predictable tail latency for latency-sensitive servers he would need to set qos-memory-overcommitment to a really low value(preferrably 0). In this case memory resources would be reserved for the G and Bu pods and BE pods would be able to use only the left over memory resource.

cc @derekwaynecarr @vishh @philips @davidopp

@vishh
Copy link
Contributor

vishh commented Jul 11, 2016

@davidopp PTAL at this patch - ad8e2c2

@derekwaynecarr This patch is ready to go, unless you have any more comments.

@dubstack
Copy link
Author

@davidopp @derekwaynecarr ping.

@davidopp
Copy link
Member

Don't wait on me -- based on #26751 (comment) it sounds like you've addressed my concern, and I won't have time to re-read the doc soon.

@dubstack dubstack added this to the v1.4 milestone Jul 24, 2016
@dubstack dubstack added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jul 24, 2016
@vishh vishh self-assigned this Jul 25, 2016
@vishh vishh added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 25, 2016
Signed-off-by: Buddha Prakash <buddhap@google.com>
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 25, 2016
@dubstack
Copy link
Author

Reattaching LGTM after running hack/update-munge-docs.sh to modify the image warning urls at the top.

@dubstack dubstack added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jul 25, 2016
@k8s-github-robot k8s-github-robot added release-note-label-needed and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jul 25, 2016
@dubstack dubstack added lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jul 25, 2016
@k8s-bot
Copy link

k8s-bot commented Jul 25, 2016

GCE e2e build/test passed for commit e425c6b.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet