Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vertical pod auto-sizer #10782

Closed
erictune opened this issue Jul 6, 2015 · 34 comments
Closed

Vertical pod auto-sizer #10782

erictune opened this issue Jul 6, 2015 · 34 comments
Labels
area/isolation lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@erictune
Copy link
Member

erictune commented Jul 6, 2015

We should create a vertical auto-sizer. A vertical auto-sizer sets the compute resource limits and request for pods which do not have them set, and periodically adjust them based on demand signals. It does not directly deal with replication controllers, services, or nodes.

Related issues:

@erictune erictune added this to the v1.0-post milestone Jul 6, 2015
@erictune erictune added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jul 6, 2015
@bgrant0607 bgrant0607 added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. area/isolation labels Jul 6, 2015
@davidopp
Copy link
Member

davidopp commented Jul 7, 2015

@jszczepkowski Do you have any issues open already on this?

@jszczepkowski
Copy link
Contributor

I don't have any issue for it yet.

@piosz piosz mentioned this issue Jul 20, 2015
19 tasks
@piosz piosz added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Jul 20, 2015
@bgrant0607 bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015
@jszczepkowski
Copy link
Contributor

We are planning to divide vertical pod autoscaling into three implementation steps. We are going to deliver the first of them, Setting Initial Resources, for version 1.1.

Setting Initial Resources

Setting initial resources will be implemented as an admission plugin. It will try to estimate and set values for request memory/CPU for containers within each pod if they were not given by user. (The plugin will not set limit to avoid OOM killing).

We will additionally annotate containers metrics with image name. Usage for given image will be aggregated (it is not yet decided how and by whom) and initial resource plugin will set request base on the aggregation.

Reactive vertical autoscaling by deployment update

We will add a new object: vertical pod autoscaler, which will work on a deployment level. User, when specifying a pod template in a deployment, will have an option to set enable_vertical_autoscaler flag. If the auto flag is given, vertical pod autoscaler will monitor resource usage of pod’s containers and change they resource requirements by updating the pod’s template in the deployment object. So, the deployment will act as the actuator of the autoscaler. Note that user can both specify requirements for pod and turn on auto flag for it. In such case, the requirements given by the user will be treated only as initial, and may be overwritten by the autoscaler.

Reactive vertical autoscaling by in-place update

We have an initial idea of a more complicated autoscaler, which will not be bounded to the deployment object, but will work on a pod level, and will actuate resource requirements by in-place update of the pod. Such autoscaler, before the update, will first need to consult the scheduler, if the new resources for the pod will fit, and in-place update is feasible. The answer given by scheduler will not be 100% reliable: it still may happen that the pod after the in-place update will be killed by kubelet due to lack of resources.

@jszczepkowski
Copy link
Contributor

CC @piosz @fgrzadkowski

@jszczepkowski jszczepkowski changed the title Vertical auto-sizer Vertical pod auto-sizer Aug 3, 2015
@smarterclayton
Copy link
Contributor

@derekwaynecarr @ncdc

@smarterclayton
Copy link
Contributor

Metrics aggregation needs some top level issue to track - I'm not aware of one but we'd like to see it be usable from several angles - UI and tracking other container metrics from related systems (load balancers)

@AnanyaKumar
Copy link
Contributor

\CC me

@erictune
Copy link
Member Author

It is important that there be feedback when the predictions are wrong. In particular, I think it is important that a Pod which is over its request (due to an incorrect initial prediction) is much more likely to be killed than some other pod which is under its request. That way, a malfunction of the "Setting Initial Limits" system appears to affect specific pods, and not random pods, making it very difficult to diagnose.

One way to do that is to make the kill probability in a system OOM situation proportional to the amount over request. @AnanyaKumar @vishh does the current implementation have that property?

@dchen1107 pointed out that it is bad to have a system OOM in the first place. So, two things we might do are:

  • have the "Setting Initial Limits" system set a Limit which is, say, 2x the initial request. That way, assuming there are several similar size pods on a machine, a single misbehaved pod will exceed its pod limit before causing a system OOM. This is not foolproof but I think it is a really good heuristic to start with. @jszczepkowski thoughts on this specific suggestion? If we do this in the initial version, we can always back it out later, but adding it later is harder since it breaks people's assumptions.
  • @dchen1107 and others have talked about a system to dynamically set the memory limits on pods which are over their request to such a value that a single pod spiking is unlikely to cause a system oom. This requires a control loop, but we have experience that suggests user-space control loops can work well. The drawback of this approach is that it requires support for updating limits, which Docker doesn't support yet. So, I don't think we can do this for v1.1.

TL;DR: can we please set limit to 2x predicted request for v1.1.

@vishh
Copy link
Contributor

vishh commented Aug 14, 2015

@erictune: Yes. The kernel will prefer to kill containers that exceeds their request in case of a OOM scenario.

+1 for starting with a conservative estimate.

@piosz
Copy link
Member

piosz commented Aug 17, 2015

@erictune I see your point of view and I agree that setting Limits will solve your case. On the other hands I can imagine some other situations when it causes problems rather than solves them. Especially when the estimation is wrong and the user would observe unexpected kill of his container. So that we need to have high confidence while setting Limits and we can't guarantee it from the beginning.

I think everyone agrees that setting Request should improve the overall experience which may not be true for setting Limits. Long term we definitely want to set both, but I would set only Request in the first version (which may be different then v1.1), gather some feedback from users and then eventually add setting Limits once we will have algorithm tuned.

@vishh How about having two containers that exceed their request: which one will be killed? The one that exceeds request 'more' or random one?

@vishh
Copy link
Contributor

vishh commented Aug 17, 2015

As per the current kubelet qos policy, all processes exceeding their request will be equally likely to be killed by the OOM killer.

@piosz
Copy link
Member

piosz commented Aug 17, 2015

By 'more' you mean relative or absolute value?

@vishh
Copy link
Contributor

vishh commented Aug 17, 2015

@piosz: I updated my original comment. Does it make sense now?

@fgrzadkowski
Copy link
Contributor

  • We don't need in-place update for MVP of vertical pod autoscaler. We can just be more conservative and recreate pods via deployments
  • Infrastore would be useful, but for MVP we can just aggregate this data in VPA controller if we don't have infrastore before that time or we can read this information from a monitoring pipeline

@fgrzadkowski
Copy link
Contributor

@krzysztofgrygiel

@derekwaynecarr
Copy link
Member

If we pursue a vertical autosizer that requires kicking a deployment, how hard is it to take that requirement back? For example, I would think many of our users would prefer a solution that did not require a re-deploy and instead could re-size existing pods.

@erictune
Copy link
Member Author

How would a vertical autosizer that did restarts work with a Deployment, exactly? Can resizing happen concurrently with new image rollout? If the user wants to roll back the image change, and there was an intervening resource change, what happens? Can I end up with four replicaSets (cross product of two image version and old/new resource advice)? Are these competing for revisionHistoryLimit?

@smarterclayton
Copy link
Contributor

An autosizer blowing out my deployment revision budget is not ok :)

On Nov 18, 2016, at 11:08 AM, Derek Carr notifications@github.com wrote:

If we pursue a vertical autosizer that requires kicking a deployment, how
hard is it to take that requirement back? For example, I would think many
of our users would prefer a solution that did not require a re-deploy and
instead could re-size existing pods.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#10782 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pxXhEpFLDNpB3fvifTuHPuCw9PX9ks5q_c1fgaJpZM4FTFDA
.

@derekwaynecarr
Copy link
Member

That said, I think most java apps would need to restart to take advantage,
but ideally that restart would not result in a reschedule.

On Fri, Nov 18, 2016 at 3:21 PM, Clayton Coleman notifications@github.com
wrote:

An autosizer blowing out my deployment revision budget is not ok :)

On Nov 18, 2016, at 11:08 AM, Derek Carr notifications@github.com wrote:

If we pursue a vertical autosizer that requires kicking a deployment, how
hard is it to take that requirement back? For example, I would think many
of our users would prefer a solution that did not require a re-deploy and
instead could re-size existing pods.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/kubernetes/kubernetes/issues/
10782#issuecomment-261569621>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_
pxXhEpFLDNpB3fvifTuHPuCw9PX9ks5q_c1fgaJpZM4FTFDA>
.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#10782 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbOXZpetfpluUuQrbAv2-4i2vJzErks5q_gjkgaJpZM4FTFDA
.

@davidopp
Copy link
Member

@derekwaynecarr I don't think anyone is proposing "a vertical autosizer that requires kicking a deployment" in the long run -- instead I think @fgrzadkowski is saying that the first version would work that way because it's simpler and isn't blocked on in-place resource update in kubelet. Our plan for the next step of PodDisruptionBudget is to allow it to specify a disruption rate, not just a max-number-simultaneously-down. So you could imagine attaching a max disruption rate PDB to your Deployment, that the vertical autoscaler would respect (i.e. it would not exceed the specified rate when it does resource updates that require killing the container).

I think @erictune is asking a good question. I was surprised that @fgrzadkowski said vertical autoscaling would create a new Deployment. IIRC in Borg we use an API that is distinct from collection update (i.e. does not create a new future collection), to handle vertical autoscaling, so that it doesn't interfere with any user-initiated update that might be ongoing at the same time.

@fgrzadkowski
Copy link
Contributor

@davidopp I didn't suggest creating new deployment. I only suggested to change requirements via existing deployment.

@erictune I think those are great questions! I don't have concrete answers - it should be covered in a proposal/design doc. However I recall a conversation with @bgrant0607 some time ago, that Deployment could potentially have multiple rollouts in-flight on different vertices. With regard to limiting how quickly we would roll it out I agree with @davidopp that it should be solved by PDB.

@derekwaynecarr I imagine that initially in the validation we would always expect that the target object would be Deployment. Later we can relax this requirement and accept RelicaSets or Pods. Even in the final solution user should be aware that some reschedules/restarts may happen. Maybe we just document that "for now" we will always recreate pod. In the future we will change it. Maybe supporting in-place upgrade should be a prerequisite for becoming a GA feature?

@bgrant0607
Copy link
Member

Some quick comments:

@dhzhuo
Copy link

dhzhuo commented Mar 16, 2017

@jszczepkowski You mentioned that "the admission plugin will try to estimate and set values for request memory/CPU for containers within each pod if they were not given by user."

Can you please elaborate how the admission plug does the estimation? Does it estimate based on historical data of similar jobs, or based on some profiling result, or something else? thanks

@jszczepkowski
Copy link
Contributor

You mentioned that "the admission plugin will try to estimate and set values for request memory/CPU for containers within each pod if they were not given by user."

Can you please elaborate how the admission plug does the estimation? Does it estimate based on historical data of similar jobs, or based on some profiling result, or something else? thanks

@bitbyteshort
You are referring to old, obsolete design. For the current design proposal please see kubernetes/community#338.

@dhzhuo
Copy link

dhzhuo commented Mar 20, 2017

@jszczepkowski thanks for pointing me to the latest proposal.

@mhausenblas
Copy link

FYI: I've put together a blog post aiming at raising awareness and introducing our demonstrators resorcerer.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2018
@warmchang
Copy link
Contributor

Can we try this VPA feture in K8S release 1.8?

@DirectXMan12
Copy link
Contributor

not really. None of the work has really landed yet (except for @mhausenblas's PoC called resourcerer, but that's not quite the same as the final design).

@bgrant0607
Copy link
Member

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 22, 2018
@mwielgus
Copy link
Contributor

VPA is in alpha in https://github.com/kubernetes/autoscaler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/isolation lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests