Vertical pod auto-sizer #10782

erictune · 2015-07-06T21:46:18Z

We should create a vertical auto-sizer. A vertical auto-sizer sets the compute resource limits and request for pods which do not have them set, and periodically adjust them based on demand signals. It does not directly deal with replication controllers, services, or nodes.

Related issues:

davidopp · 2015-07-07T00:35:04Z

@jszczepkowski Do you have any issues open already on this?

jszczepkowski · 2015-07-07T08:52:16Z

I don't have any issue for it yet.

jszczepkowski · 2015-08-03T15:42:59Z

We are planning to divide vertical pod autoscaling into three implementation steps. We are going to deliver the first of them, Setting Initial Resources, for version 1.1.

Setting Initial Resources

Setting initial resources will be implemented as an admission plugin. It will try to estimate and set values for request memory/CPU for containers within each pod if they were not given by user. (The plugin will not set limit to avoid OOM killing).

We will additionally annotate containers metrics with image name. Usage for given image will be aggregated (it is not yet decided how and by whom) and initial resource plugin will set request base on the aggregation.

Reactive vertical autoscaling by deployment update

We will add a new object: vertical pod autoscaler, which will work on a deployment level. User, when specifying a pod template in a deployment, will have an option to set enable_vertical_autoscaler flag. If the auto flag is given, vertical pod autoscaler will monitor resource usage of pod’s containers and change they resource requirements by updating the pod’s template in the deployment object. So, the deployment will act as the actuator of the autoscaler. Note that user can both specify requirements for pod and turn on auto flag for it. In such case, the requirements given by the user will be treated only as initial, and may be overwritten by the autoscaler.

Reactive vertical autoscaling by in-place update

We have an initial idea of a more complicated autoscaler, which will not be bounded to the deployment object, but will work on a pod level, and will actuate resource requirements by in-place update of the pod. Such autoscaler, before the update, will first need to consult the scheduler, if the new resources for the pod will fit, and in-place update is feasible. The answer given by scheduler will not be 100% reliable: it still may happen that the pod after the in-place update will be killed by kubelet due to lack of resources.

jszczepkowski · 2015-08-03T15:43:19Z

CC @piosz @fgrzadkowski

smarterclayton · 2015-08-04T04:15:44Z

@derekwaynecarr @ncdc

smarterclayton · 2015-08-04T04:17:23Z

Metrics aggregation needs some top level issue to track - I'm not aware of one but we'd like to see it be usable from several angles - UI and tracking other container metrics from related systems (load balancers)

AnanyaKumar · 2015-08-12T16:16:34Z

\CC me

erictune · 2015-08-14T21:15:36Z

It is important that there be feedback when the predictions are wrong. In particular, I think it is important that a Pod which is over its request (due to an incorrect initial prediction) is much more likely to be killed than some other pod which is under its request. That way, a malfunction of the "Setting Initial Limits" system appears to affect specific pods, and not random pods, making it very difficult to diagnose.

One way to do that is to make the kill probability in a system OOM situation proportional to the amount over request. @AnanyaKumar @vishh does the current implementation have that property?

@dchen1107 pointed out that it is bad to have a system OOM in the first place. So, two things we might do are:

have the "Setting Initial Limits" system set a Limit which is, say, 2x the initial request. That way, assuming there are several similar size pods on a machine, a single misbehaved pod will exceed its pod limit before causing a system OOM. This is not foolproof but I think it is a really good heuristic to start with. @jszczepkowski thoughts on this specific suggestion? If we do this in the initial version, we can always back it out later, but adding it later is harder since it breaks people's assumptions.
@dchen1107 and others have talked about a system to dynamically set the memory limits on pods which are over their request to such a value that a single pod spiking is unlikely to cause a system oom. This requires a control loop, but we have experience that suggests user-space control loops can work well. The drawback of this approach is that it requires support for updating limits, which Docker doesn't support yet. So, I don't think we can do this for v1.1.

TL;DR: can we please set limit to 2x predicted request for v1.1.

vishh · 2015-08-14T22:42:22Z

@erictune: Yes. The kernel will prefer to kill containers that exceeds their request in case of a OOM scenario.

+1 for starting with a conservative estimate.

piosz · 2015-08-17T13:54:36Z

@erictune I see your point of view and I agree that setting Limits will solve your case. On the other hands I can imagine some other situations when it causes problems rather than solves them. Especially when the estimation is wrong and the user would observe unexpected kill of his container. So that we need to have high confidence while setting Limits and we can't guarantee it from the beginning.

I think everyone agrees that setting Request should improve the overall experience which may not be true for setting Limits. Long term we definitely want to set both, but I would set only Request in the first version (which may be different then v1.1), gather some feedback from users and then eventually add setting Limits once we will have algorithm tuned.

@vishh How about having two containers that exceed their request: which one will be killed? The one that exceeds request 'more' or random one?

vishh · 2015-08-17T16:48:48Z

As per the current kubelet qos policy, all processes exceeding their request will be equally likely to be killed by the OOM killer.

piosz · 2015-08-17T18:12:25Z

By 'more' you mean relative or absolute value?

vishh · 2015-08-17T21:11:38Z

@piosz: I updated my original comment. Does it make sense now?

fgrzadkowski · 2016-11-17T09:14:34Z

We don't need in-place update for MVP of vertical pod autoscaler. We can just be more conservative and recreate pods via deployments
Infrastore would be useful, but for MVP we can just aggregate this data in VPA controller if we don't have infrastore before that time or we can read this information from a monitoring pipeline

fgrzadkowski · 2016-11-17T09:36:57Z

@krzysztofgrygiel

derekwaynecarr · 2016-11-18T16:07:50Z

If we pursue a vertical autosizer that requires kicking a deployment, how hard is it to take that requirement back? For example, I would think many of our users would prefer a solution that did not require a re-deploy and instead could re-size existing pods.

erictune · 2016-11-18T19:23:01Z

How would a vertical autosizer that did restarts work with a Deployment, exactly? Can resizing happen concurrently with new image rollout? If the user wants to roll back the image change, and there was an intervening resource change, what happens? Can I end up with four replicaSets (cross product of two image version and old/new resource advice)? Are these competing for revisionHistoryLimit?

smarterclayton · 2016-11-18T20:21:39Z

An autosizer blowing out my deployment revision budget is not ok :)

On Nov 18, 2016, at 11:08 AM, Derek Carr notifications@github.com wrote:

If we pursue a vertical autosizer that requires kicking a deployment, how
hard is it to take that requirement back? For example, I would think many
of our users would prefer a solution that did not require a re-deploy and
instead could re-size existing pods.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#10782 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pxXhEpFLDNpB3fvifTuHPuCw9PX9ks5q_c1fgaJpZM4FTFDA
.

derekwaynecarr · 2016-11-18T20:55:19Z

That said, I think most java apps would need to restart to take advantage,
but ideally that restart would not result in a reschedule.

On Fri, Nov 18, 2016 at 3:21 PM, Clayton Coleman notifications@github.com
wrote:

An autosizer blowing out my deployment revision budget is not ok :)

On Nov 18, 2016, at 11:08 AM, Derek Carr notifications@github.com wrote:

If we pursue a vertical autosizer that requires kicking a deployment, how
hard is it to take that requirement back? For example, I would think many
of our users would prefer a solution that did not require a re-deploy and
instead could re-size existing pods.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/kubernetes/kubernetes/issues/
10782#issuecomment-261569621>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_
pxXhEpFLDNpB3fvifTuHPuCw9PX9ks5q_c1fgaJpZM4FTFDA>
.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#10782 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbOXZpetfpluUuQrbAv2-4i2vJzErks5q_gjkgaJpZM4FTFDA
.

davidopp · 2016-11-19T00:26:44Z

@derekwaynecarr I don't think anyone is proposing "a vertical autosizer that requires kicking a deployment" in the long run -- instead I think @fgrzadkowski is saying that the first version would work that way because it's simpler and isn't blocked on in-place resource update in kubelet. Our plan for the next step of PodDisruptionBudget is to allow it to specify a disruption rate, not just a max-number-simultaneously-down. So you could imagine attaching a max disruption rate PDB to your Deployment, that the vertical autoscaler would respect (i.e. it would not exceed the specified rate when it does resource updates that require killing the container).

I think @erictune is asking a good question. I was surprised that @fgrzadkowski said vertical autoscaling would create a new Deployment. IIRC in Borg we use an API that is distinct from collection update (i.e. does not create a new future collection), to handle vertical autoscaling, so that it doesn't interfere with any user-initiated update that might be ongoing at the same time.

fgrzadkowski · 2016-11-21T11:07:51Z

@davidopp I didn't suggest creating new deployment. I only suggested to change requirements via existing deployment.

@erictune I think those are great questions! I don't have concrete answers - it should be covered in a proposal/design doc. However I recall a conversation with @bgrant0607 some time ago, that Deployment could potentially have multiple rollouts in-flight on different vertices. With regard to limiting how quickly we would roll it out I agree with @davidopp that it should be solved by PDB.

@derekwaynecarr I imagine that initially in the validation we would always expect that the target object would be Deployment. Later we can relax this requirement and accept RelicaSets or Pods. Even in the final solution user should be aware that some reschedules/restarts may happen. Maybe we just document that "for now" we will always recreate pod. In the future we will change it. Maybe supporting in-place upgrade should be a prerequisite for becoming a GA feature?

bgrant0607 · 2016-11-22T00:23:25Z

Some quick comments:

We should think about how this would interact with LimitRange, esp. if we add a label selector to it (Configurable pod defaulting #17097).
Updates to some applications, such as Java ones, would also require configuration changes, such as heap size and thread-pool size. We should think about how we'll handle that.
In-place rolling update is In-place rolling updates #9043. Multiple independent updates would require deep changes to Deployment and to other controllers that we want to support this.
In-place pod update is Support pod resource updates #5774.
Another example imperative operation: relabeling Provide a way to relabel controllers and their pods #36897
We need to think about whether we'd want to be able to rollback resource changes. I assume not, but this would require a new mechanism (probably something like v2 API proposal "desired vs actual" #17333), and we need to think about how to explain it to users, since at least one person found it confusing that we don't rollback replicas changes (Deployment Undo doesn't undo number of replicas #25236).

dhzhuo · 2017-03-16T22:14:43Z

@jszczepkowski You mentioned that "the admission plugin will try to estimate and set values for request memory/CPU for containers within each pod if they were not given by user."

Can you please elaborate how the admission plug does the estimation? Does it estimate based on historical data of similar jobs, or based on some profiling result, or something else? thanks

jszczepkowski · 2017-03-17T08:48:28Z

You mentioned that "the admission plugin will try to estimate and set values for request memory/CPU for containers within each pod if they were not given by user."

Can you please elaborate how the admission plug does the estimation? Does it estimate based on historical data of similar jobs, or based on some profiling result, or something else? thanks

@bitbyteshort
You are referring to old, obsolete design. For the current design proposal please see kubernetes/community#338.

dhzhuo · 2017-03-20T01:34:29Z

@jszczepkowski thanks for pointing me to the latest proposal.

mhausenblas · 2017-08-05T10:23:30Z

FYI: I've put together a blog post aiming at raising awareness and introducing our demonstrators resorcerer.

fejta-bot · 2018-01-02T10:57:39Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

warmchang · 2018-01-15T12:03:58Z

Can we try this VPA feture in K8S release 1.8?

DirectXMan12 · 2018-01-16T19:21:18Z

not really. None of the work has really landed yet (except for @mhausenblas's PoC called resourcerer, but that's not quite the same as the final design).

bgrant0607 · 2018-01-22T19:18:54Z

/remove-lifecycle stale
/lifecycle frozen

mwielgus · 2018-10-22T13:43:28Z

VPA is in alpha in https://github.com/kubernetes/autoscaler

erictune added team/cluster and removed team/cluster labels Jul 6, 2015

erictune added this to the v1.0-post milestone Jul 6, 2015

erictune added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jul 6, 2015

erictune mentioned this issue Jul 6, 2015

Scheduler needs to deal with pods without resource limits #10242

Closed

bgrant0607 added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. area/isolation labels Jul 6, 2015

davidopp assigned jszczepkowski Jul 7, 2015

piosz mentioned this issue Jul 20, 2015

Implement Resource Consumer #11570

Closed

19 tasks

piosz added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Jul 20, 2015

bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015

jszczepkowski mentioned this issue Aug 3, 2015

Setting Initial Resources #12149

Closed

jszczepkowski changed the title ~~Vertical auto-sizer~~ Vertical pod auto-sizer Aug 3, 2015

erictune mentioned this issue Aug 7, 2015

Improved memory usage measurements #12422

Closed

fgrzadkowski mentioned this issue Aug 10, 2015

Proposal: scaling interface #1629

Closed

piosz mentioned this issue Aug 12, 2015

Initial Resources proposal #12472

Merged

fgrzadkowski assigned krzysztofgrygiel Nov 25, 2016

yuvipanda mentioned this issue Dec 28, 2016

Set up autoscaling for nodes data-8/jupyterhub-k8s#74

Closed

davidopp mentioned this issue Feb 12, 2017

[RFE] Update PodSpec on eviction * 2 of initial request. #41228

Closed

bgrant0607 mentioned this issue Feb 24, 2017

Vertical Pod Autoscaler - design proposal. kubernetes/community#338

Merged

jszczepkowski unassigned jszczepkowski and krzysztofgrygiel Mar 1, 2017

piosz mentioned this issue Apr 5, 2017

Infrastore for Kubernetes #44095

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2018

bgrant0607 removed the team/control-plane (deprecated - do not use) label Jan 22, 2018

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 22, 2018

mwielgus closed this as completed Oct 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vertical pod auto-sizer #10782

Vertical pod auto-sizer #10782

erictune commented Jul 6, 2015

davidopp commented Jul 7, 2015

jszczepkowski commented Jul 7, 2015

jszczepkowski commented Aug 3, 2015

jszczepkowski commented Aug 3, 2015

smarterclayton commented Aug 4, 2015

smarterclayton commented Aug 4, 2015

AnanyaKumar commented Aug 12, 2015

erictune commented Aug 14, 2015

vishh commented Aug 14, 2015

piosz commented Aug 17, 2015

vishh commented Aug 17, 2015

piosz commented Aug 17, 2015

vishh commented Aug 17, 2015

fgrzadkowski commented Nov 17, 2016

fgrzadkowski commented Nov 17, 2016

derekwaynecarr commented Nov 18, 2016

erictune commented Nov 18, 2016

smarterclayton commented Nov 18, 2016

derekwaynecarr commented Nov 18, 2016

davidopp commented Nov 19, 2016

fgrzadkowski commented Nov 21, 2016

bgrant0607 commented Nov 22, 2016

dhzhuo commented Mar 16, 2017

jszczepkowski commented Mar 17, 2017

dhzhuo commented Mar 20, 2017

mhausenblas commented Aug 5, 2017

fejta-bot commented Jan 2, 2018

warmchang commented Jan 15, 2018

DirectXMan12 commented Jan 16, 2018

bgrant0607 commented Jan 22, 2018

mwielgus commented Oct 22, 2018

Vertical pod auto-sizer #10782

Vertical pod auto-sizer #10782

Comments

erictune commented Jul 6, 2015

davidopp commented Jul 7, 2015

jszczepkowski commented Jul 7, 2015

jszczepkowski commented Aug 3, 2015

Setting Initial Resources

Reactive vertical autoscaling by deployment update

Reactive vertical autoscaling by in-place update

jszczepkowski commented Aug 3, 2015

smarterclayton commented Aug 4, 2015

smarterclayton commented Aug 4, 2015

AnanyaKumar commented Aug 12, 2015

erictune commented Aug 14, 2015

vishh commented Aug 14, 2015

piosz commented Aug 17, 2015

vishh commented Aug 17, 2015

piosz commented Aug 17, 2015

vishh commented Aug 17, 2015

fgrzadkowski commented Nov 17, 2016

fgrzadkowski commented Nov 17, 2016

derekwaynecarr commented Nov 18, 2016

erictune commented Nov 18, 2016

smarterclayton commented Nov 18, 2016

derekwaynecarr commented Nov 18, 2016

davidopp commented Nov 19, 2016

fgrzadkowski commented Nov 21, 2016

bgrant0607 commented Nov 22, 2016

dhzhuo commented Mar 16, 2017

jszczepkowski commented Mar 17, 2017

dhzhuo commented Mar 20, 2017

mhausenblas commented Aug 5, 2017

fejta-bot commented Jan 2, 2018

warmchang commented Jan 15, 2018

DirectXMan12 commented Jan 16, 2018

bgrant0607 commented Jan 22, 2018

mwielgus commented Oct 22, 2018