Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert ReplicationController to a plugin (was ReplicationController redesign) #3058

Closed
bgrant0607 opened this issue Dec 19, 2014 · 35 comments
Closed
Labels
area/api Indicates an issue on api area. area/extensibility area/usability kind/design Categorizes issue or PR as related to design. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@bgrant0607
Copy link
Member

Continued from #1518 and #3024, and also relates to #1624.

Since there will necessarily be a higher barrier to entry for new objects, we should make replication controller composable (e.g., with auto-scalers and deployment managers), pluggable, hookable (#2804), to delegate decisions like which pod to kill to decrease the replica count, when pods should be killed and replaced (esp. moving to new nodes), when the whole collection should terminate (for RestartPolicyOnFailure), etc.

For jobs that terminate (RestartPolicyOnFailure), I'd make the count auto-decrement on success. I'm tempted to say we should support terminating jobs of the per-node variety, but then we'd have to keep track on which nodes the pods had executed, which seems ugly.

It should support graceful termination (#1535), but I've removed that from this sketch, since it's not specific to replication controller.

Stab at a definition, also using my latest proposed name change:

type OverseerSpec struct {
    ReplicationPolicy *ReplicationPolicy `json:"replicationPolicy,omitempty"`
    Selector map[string]string `json:"selector,omitempty"`
    TemplateRef *ObjectReference `json:"templateRef,omitempty"`
    Template *PodTemplateSpec `json:"template,omitempty"`
}
type ReplicationPolicy struct {
    Count *ReplicationPolicyCount    `json:"count,omitempty"`
    PerNode *ReplicationPolicyPerNode `json:"perNode,omitempty"`
}
type ReplicationPolicyCount struct {
    Count int `json:"count"`
}
type ReplicationPolicyPerNode struct {
}

Forgiveness example was moved to #1574.

Additional example policies have been removed.

/cc @smarterclayton @thockin @brendandburns @erictune @lavalamp

@bgrant0607 bgrant0607 added kind/design Categorizes issue or PR as related to design. priority/backlog Higher priority than priority/awaiting-more-evidence. area/api Indicates an issue on api area. area/usability labels Dec 19, 2014
@jbeda
Copy link
Contributor

jbeda commented Dec 19, 2014

I want to push back here a little. This is a pretty big change in our model. I can see the reasons to do this, but I think we are making our system look very complicated.

It is perhaps unstated, but my understanding of a core principle for kubernetes is Small, easy to understand composable objects that can be combined for higher order behavior. The number of options here bust the "small and easy to understand" part of this.

One other thing -- we always imagined that the ReplicationController was one of a class/set/type of object in our system. We want to encourage users to think about other "active" controller objects that can take on management policy for them.

What if we have a set of objects but have them share implementation in the back end? There is a bunch of overlap here, but I think we can steer users toward a set of concepts that are easier to understand. And encourage them to think about and build other controllers that can play in this world.

@jbeda
Copy link
Contributor

jbeda commented Dec 19, 2014

As for the feature set implied here, I'd expect that "PerNode" would include a node selector as to which nodes to run this pod on.

@davidopp
Copy link
Member

I think we should separate the questions of whether these policies should
be rolled into a single object vs. whether the policies themselves are
overly general/configurable.

I like the idea of putting these policies into a single object; decomposing
this stuff into separate objects may be difficult and/or confusing.

However, I think we should start simple with the policies. For example, the
forgiveness stuff came be drastically simplified. For now we could just
have a single bool called "persistent" that under the covers maps to one
particular setting of the forgiveness parameters in Brian's proposal, but
those parameters wouldn't be exposed to users.

I wasn't sure what the two different fields of CancellationPolicy meant but
presumably we could pick just one of those to expose to users, as a single
field like "CancellationGracePeriodSeconds" and either don't implement the
other or always set it equal to CancellationGracePeriodSeconds.

In this way, basically everything from ForgivenessPolicy down in Brian's
proposal would be collapsed to just two items, "persistent" and
"CancellationGracePeriodSeconds"

On Fri, Dec 19, 2014 at 9:52 AM, Joe Beda notifications@github.com wrote:

As for the feature set implied here, I'd expect that "PerNode" would
include a node selector as to which nodes to run this pod on.


Reply to this email directly or view it on GitHub
#3058 (comment)
.

@bgrant0607
Copy link
Member Author

@jbeda Much of this functionality we don't need right now. I wrote this up to force a conscious decision on our direction. At present, ReplicationController is the only simple object. Service and Pod are complex and getting more complex over time. I think questions are:

  1. How to make simple cases simple and not overwhelm users with complexity? Config generation is part of the answer, but not the whole answer.
  2. How can we represent the functionality we need in the simplest possible way, but not be overly simplistic?
  3. What parts need to be separate but composable?
  4. How do we deal with polymorphism -- objects within a class? Code sharing is only a part of the problem.

The per-node case doesn't need a selector. Pods already have one. There's no point in creating pods that won't pass the selector fit predicate. That should be easy for the per-node controller to figure out.

@davidopp CancellationGracePeriodSeconds sounds like what we call "preemption notice" internally. That should be a property of the pod, not the controller. Forgiveness is on the controller, because it's responsible for performing the replacement. I could see a case for putting it on the pod, as well, though, even though Kubelet would never act upon it. The node controller could act upon the node-related events.

We could put some of these in separate objects. For example, a separate entity could independently enforce deadlines. However, functionality that would modify the core behavior of the replication controller could not be. Considerations for inclusion include similarity to other responsibilities, whether inclusion would obstruct composability, and the complexity of more moving parts. Deadline enforcement, for instance, wouldn't preclude an external agent from shutting down the controller, so it doesn't obstruct composability, and it's simple enough and similar enough to other termination decisions, that it should be straightforward to include, assuming we don't split control of terminating workloads into another object.

I'm fine with starting with persistent rather than full-blown forgiveness, though no users have requested it yet. I'd still put it into ForgivenessPolicy.

The cancellation policies assumed a delta between creation and start -- deferred scheduling. I wouldn't imagine we'd include this anytime soon.

In the immediate future, I'd imagine only including stop and replicationPolicy (in addition to the selector and template and ref) -- almost no additional complexity added to the current object.

What I explicitly want to keep out of the controller:

  1. properties that belong in pods
  2. auto-scaling policy (e.g., WIP: auto-scaler proposal #2863)
  3. scheduling policy (e.g., spreading, colocation, pod-specific constraints)
  4. complex restart policies (e.g., max. number of failures across multiple pods, killing all pods after too many fail)
  5. inter-controller start dependencies

(1) is self-explanatory. (2), (3), and (4) need to be able to span multiple controllers. Users will want an unbounded variety of policies for (2), (4), and (5). (4) also potentially requires a greater degree of consistency than we'll eventually want to promise.

@bgrant0607
Copy link
Member Author

Some alternatives to manage complexity without separate objects:

  • Configuration generation
  • Smart defaults
  • Put "advanced options" into an optional subobject
  • Focus documentation, help, etc. on most commonly used fields

@bgrant0607
Copy link
Member Author

I forgot one of my most important principles: neither pods nor replication controllers should be considered permanent, durable entities (i.e., pets).

Pods should not have identities that survive rescheduling to new nodes, for following reasons:

  1. In reality, a pod on a new node is a new pod, controlled by a different Kubelet.
  2. Currently, the new pod will have a new IP address and new storage volumes.
  3. Even once they are migratable, the Kubelets and the component performing the migration will need to be aware of both the source and destination pods.
  4. Upon a planned termination, such as due to disruptive node maintenance, rolling update, resource change, or resource reallocation, we'll want to start the replacement pod prior to terminating the old one, to pull packages, etc., in order to minimize downtime, or to linger in order to drain persistent data off the old node after the fact (called "zombies"). Additionally, in the case of a network partition or failed Kubelet, replacement pods would need to be created while the originals still exist (called "phantoms" or "rogues").

Replication controllers should not convey identities to pods upon creation, either:

  1. Identity actually is closely related to clients' perceptions. Clients talk to services. By design, services can span multiple replication controllers.
  2. Rolling updates are simpler and more flexible with the model of creating a new replication controller for the updated deployment. Otherwise, in-place updates are required, which require vetting by the scheduler prior to being executed and careful logic in Kubelet, as well as preventing no-downtime solutions by creating replacements prior to termination. Rollbacks would also be more difficult. I also don't want to have to keep track of "shadow pod templates" behind the scenes -- rolling update with just one template is not predictable.
  3. Replication controllers are design to just tend to pods like shepherds tend to sheep. Sheep exist without shepherds. So should pods. It should be possible to create pods without a replication controller, then create a replication controller after the fact to watch over them.

Work/role assignment should really be dynamic, using master election, fine-grain locks, shard assignment, pubsub, task queues, load balancing, etc. Nominal services (#260) are a convenience to address relatively static cases.

@bgrant0607
Copy link
Member Author

I also want to minimize mutation of pods vs. the template. That should be relegated to higher-level objects, client-side configuration generation, setting of defaults when creating the pod template, and other controllers, such as auto-sizers (e.g., setting pod/container cpu and memory requirements/limits).

@davidopp
Copy link
Member

I think I must have misunderstood CancellationPolicy. What does it mean?

Why is forgiveness in the overseer spec rather than the pod template? Is there a general rule to know when some property should be a property of the overseer, and when it should be part of the pod definition?

@bgrant0607
Copy link
Member Author

Don't worry about CancellationPolicy. It's just an example of a policy we might add.

If Kubelet, the node controller, or the scheduler will consume it, then it needs to go in the pod. If only the overseer needs it, then it doesn't belong in the pod.

@bgrant0607
Copy link
Member Author

Also, in general, properties about sets of pods don't belong in the pod/podtemplate.

@davidopp
Copy link
Member

To play the devil's advocate for a moment here, why should any configuration information go in the overseer (outside of the pod template)? It seems there are many benefits to putting configuration information in pods rather than in the overseer. For example,

  1. A single pod can be controlled by multiple overseers, so you reduce confusion by putting the configuration in the pod
  2. From debugging/UX standpoint it may be simpler if all the configuration information that affects a pod's behavior is in the pod definition rather than spread across multiple components
  3. Though you might want a group of pods to obey the same behavior, you can just use the template in the overseer to generate them; you don't need to make it a configuration property of the overseer. More generally, since the overseer contains a pod template, any configuration you might put at the overseer level can always be pushed into the pod level by just moving it into the pod template section of the overseer.
  4. Somewhat speculative, but if we want to move functionality between overseer and kubelet later, less work is needed if all the information is already in the pod.

I agree it wold be nice if the overseer didn't have to read any pod information to do its work, but this seems a somewhat weak reason if it's the only reason. Moreover IIUC the overseer needs to list pods anyway to know which pods it is managing.

So it seems to me that the only information you really need to put in the overseer is exactly the information it has today -- number of replicas, label query, and pod template.

I guess the one counter-argument would be if you want to hand off control of a pod from one overseer to another and with the handoff automatically change some behavior. Making that behavior be a configuration property of the controller would avoid having to update the pod. But are there any examples of this use case? And wouldn't you need to update the pod's labels to hand off the pod anyway?

@bgrant0607
Copy link
Member Author

@davidopp Pods can be started without any controller. It doesn't make sense to specify behavior in the pod that can't be implemented without assistance of another entity.

@jbeda
Copy link
Contributor

jbeda commented Dec 22, 2014

I want to strong object to the term "overseer". This is a pretty generic term that could apply to any number of objects in kubernetes. We might as well call it the "manager".

@asim
Copy link

asim commented Dec 22, 2014

I just want to echo jbeda's comments here, "Small, easy to understand composable objects that can be combined for higher order behavior". For me this is really the advantage of kubernetes over many heavy weight orchestration systems. We should not overload kubernetes itself but rather compose an ecosystem of tools around it. Kubernetes as a low level cluster management building block should do a few things really well and then allow others to build systems around it. It's very early days and the key to a sustainable long live project that gains mass adoption is to provide an easy to use, powerful v1.0 that doesn't try to be everything for everyone.

@thockin
Copy link
Member

thockin commented Dec 22, 2014

I'm just catching up, but my gut response echoes @asim and @jbeda - simpler is better, and this is FAR from simple.

I'd rather see commonality by convention rather than by complex top-level structure, even if the cost of that is some duplicated code.

@davidopp
Copy link
Member

Can we try to sketch out the $whatever_we're_renaming_replication_controller_to use cases to try to figure this out? My recent experience was with a system where having too many components, and designing the mechanisms for them to interact, was a major source of pain, so I'm going to be biased towards the maximally monolithic approach in pretty much any situation. But I think we can approach this objectively by thinking about use cases. For example, having a separate controller for run-everywhere pods vs. non-run-everywhere pods seems like overkill to me, but probably the only downside is code duplication. But once we start talking about having a separate replication controller for each of the properties in Brian's sample code (which is what you guys are suggesting, or am I misunderstanding?), and having multiple controllers simultaneously manage multiple pods, I think things get confusing quickly. But sketching out more scenarios of how we see controllers being used might help.

@bgrant0607
Copy link
Member Author

I think people are mostly reacting to example policies, so I've removed them. I also removed forgiveness, which should go in Pod and the suspend/stop bits, which are not specific to replication controller.

What remains still stands out as by far the simplest object in Kubernetes. I agree that we want composable building blocks, and I thought fairly carefully about what functionality I proposed including in replication controller vs. what functionality was important to keep separate. The rationale was documented above.

As for what we call it, let's bikeshed on that in #3024.

@thockin
Copy link
Member

thockin commented Dec 23, 2014

You've now reduced it to the same structure as volumes. I've always been on the fence about this aspect, finding value in both approaches. My main concern is that we had everyone in the room and the general consensus was to use different objects, and now this proposal is the opposite of that.

I'm afraid that if we try to be everything to every case, we will serve none of them well.

What does Spec.Selector represent?

What about the proposed "job controller" - is that another policy or are we going to overload logic based on template's restart policy?

What about the proposals around durable/replicated data (I'm still catching up on email)?

Will policy be arbitrarily extensible (like volume plugins should be)? How will that actually work for a cluster admin to install extensions?

@davidopp
Copy link
Member

It feels to me like we understand two categories of use cases of replication controllers well: using them for heterogeneous deployments (rolling upgrade, multiple release tracks, etc.), and using them for different node selections (e.g. run-on-every-node vs. normal services). But there's another axis of customization that people keep alluding to but I don't yet understand, namely having multiple replication controllers (or different kinds of controllers, maybe one replication controller plus other types of controllers) manage different behaviors of the same set of pods. How does this work? What are some examples? I'm not saying I think this is a bad idea, but I think that being more concrete about this will help us make decisions about how to decompose the configuration of these behaviors.

@jasonkuhrt
Copy link
Contributor

For example, having a separate controller for run-everywhere pods vs. non-run-everywhere pods seems like overkill to me

@davidopp I think different "controllers" for different semantic purposes is not a bad thing, even if it seems like two or four or either things could be rolled into one with a few config points. As my point of view is mainly as a software engineer doing functional programming I imagine that what I want is a large set of simple primitives with absolutely minimal config (and where possible, consistent config fields across these primitives). There are lots of other analogies: lego bricks, etc. I honestly feel that anything other than trivial config files are incredibly confusing and frustrating, hiding what should be programming logic. I would be very happy to have a periodic table of Kubernetes. I would be very happy to have long flat manifests as opposed to short deeply nested ones!

@bgrant0607
Copy link
Member Author

Thoughts on the separate-object approach:

  • Polymorphism/reflection/meta-programming: For everything that would need to deal with similar flavors of controllers (UIs, CLIs, etc.), we'd need an API that would provide a list of Kinds that were controllers and/or, equivalently, Kind metadata that would identify which Kinds were controllers. That would be much more robust than just matching on "Controller" in the Kind. We also need a resize verb (Proposal: scaling interface #1629), since at least 2 types of controllers would need to be resizable (service and batch job controllers), but we need that regardless. Note that some types of controllers (per-node, cron) wouldn't support resize, but I think that would work about the same with the subobject approach, as well.
  • Code reuse: We'd want to eliminate the vast majority of code duplication: generic rest objects, controller library, splitting out the pod template (Separate the pod template from replicationController #170), built-in readiness checks (Support health (readiness) checks #620), ... We want almost all of those things, anyway.
  • API surface: This is a trade off: more first-class objects vs. more complex objects. As discussed above, even with the subobject approach, controllers will stand out as the simplest objects in the system. Therefore, I think more top-level objects would actually make the system look more complex rather than simpler, especially if we don't implement most of the features I used as examples. I also think that using separate objects is more likely to drive more complexity into pods, as @davidopp's comments demonstrate, in order to avoid duplicating that logic in multiple controllers.
  • Extensibility: This is where the separate-object approach shines. Really, only pods and services need to be core objects in the system, since they directly affect underlying infrastructure resources. All controllers could/should be plugins, which would make it easier to add new controllers, such as a cron controller.

I think the best motivating example for the separate-object approach is the job controller (#1624). That's likely to need a number of features specific to bounded-duration/deferred-execution batch/workflow types of jobs: queuing and execution deadlines, success/failure aggregation, gang scheduling and/or admission control, max-in-flight limits (e.g., run 50 pods, no more than 10 at a time), inter-job dependencies (A before B), pre-start auto-scaling/auto-sizing, intelligent response to out-of-resource events, ...

Potentially nominal services could be considered "controllers" also (controlling services). So could auto-scalers (controlling replication controllers). The flavor of controller could maybe be inferred by the Kind of object controlled, or perhaps that needs to be part of the reflection API, also.

I think the separate-object approach could work, but we need to design the reflection API that would enable meta-programming for multiple kinds of controllers with similar features.

Other thoughts on the per-node controller:

  • I'd like to avoid requiring the "per-node" controller, however that functionality is implemented, to essentially be a scheduler. As mentioned several times in other places, there are ways we can avoid that by providing features that we'll want for other reasons.
  • I'm not convinced that "per-node" is the right concept. We have internal examples of monitoring and control components that need to run per rack, per power unit, per storage array, etc.

@thockin
Copy link
Member

thockin commented Jan 7, 2015

To also capture what Brian and I discussed a bit yesterday:

The first question to ask is: Do we really think people will want to make controllers outside of kubernetes code? If the answer is no, then a monolithic compound object seems net simpler. If the answer is yes, then I think separate objects are simpler. I'm going to assume the answer is yes, because I want it to be :) Here's how I mentally run through the differences in how the mondels should work.

Separate API objects:

  • We need to implement truly self-contained REST plugins
  • We need to implement remote REST plugins (we receive the API calls, but pass operations on to another server)
  • We need to implement dynamic registration of remote REST plugins
  • We implement ReplicationController as a remote REST plugin
  • We need to factor a bunch of logic out as libraries for other controllers to use
  • API for ReplicationController stays ~same

Monolithic API object:

  • We need to refactor ReplicationController along the lines of the above proposal (which is like kubelet volumes)
  • We need to implement self-contained controller plugins
  • We need to implement dynamic registration of controller plugins
  • Maybe we implement ReplicationController as a plugin
  • We need to factor a bunch of logic out as libraries for other controllers to use
  • API for ReplicationController changes

In both cases we need an extension point for plugins. In both cases we need API to be more dynamic. In both cases we need to do a bunch of library abstraction so 3rd party plugins can re-use logic (if they want). If we want a discovery system ("dear apiserver, please tell me all Kinds that act as pod controllers"), we need it in both cases. If we want to have generic code that can operate on things without knowing what they are, the problem is isomorphic for both models.

But the separate objects case gives us generic REST plugins, which I think we want in both models (net less work, net less concepts). It also retains the existing ReplicationController API.

So I fall on the side of separate objects, though only slightly.

Now we can argue about services - are services the same? Much of the same argument applies.

@bgrant0607 bgrant0607 added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 10, 2015
@davidopp
Copy link
Member

Ah OK thanks, some parts of the thread were about separating different bits of functionality for the same set of pods into separate controllers, so I didn't realize the most recent comments were just about different controllers for different pod types. So I withdraw my objection, which was applicable to this situation. :-)

@thockin
Copy link
Member

thockin commented Jan 10, 2015

For the record, we should do this with generic REST plugins, in or out of
process. I'll queue this up for people to work on.
On Jan 9, 2015 4:08 PM, "davidopp" notifications@github.com wrote:

Ah OK thanks, some parts of the thread were about separating different
bits of functionality for the same set of pods into separate controllers,
so I didn't realize the most recent comments were just about different
controllers for different pod types. So I withdraw my objection, which was
applicable to this situation. :-)

Reply to this email directly or view it on GitHub
#3058 (comment)
.

@bgrant0607
Copy link
Member Author

To finalize the proposals for the controllers planned in the immediate future:

type ReplicationControllerSpec struct {
    Size int `json:"size"`
    Selector map[string]string `json:"selector,omitempty"`
    TemplateRef *ObjectReference `json:"templateRef,omitempty"`
    Template *PodTemplateSpec `json:"template,omitempty"`
}
type DaemonControllerSpec struct {
    Selector map[string]string `json:"selector,omitempty"`
    TemplateRef *ObjectReference `json:"templateRef,omitempty"`
    Template *PodTemplateSpec `json:"template,omitempty"`
}

@bgrant0607 bgrant0607 added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 29, 2015
@bgrant0607
Copy link
Member Author

This is decided enough for 1.0, so demoting implementation to P3.

@tmrts
Copy link
Contributor

tmrts commented Mar 15, 2015

Currently the implementation is on hold for 1.0 correct? Is there any work in progress about this feature at the moment?

In the spirit of Kubernetes being modular, multiple small replicationControllers seems to be the most favored solution. For coordination, I think ZeroMQ or nsq would be a good or even better etcd. I'd be very interested to work on this.

@bgrant0607
Copy link
Member Author

The decision was to create new controllers for new use cases (e.g., per-node/daemon, RestartOnFailure pods) rather than make ReplicationController more complicated.

To facilitate that, we need to:

  1. Finish the API plugin mechanism API plugin design thread #991
  2. Convert ReplicationController to use the API plugin mechanism
  3. Split out the pod template Separate the pod template from replicationController #170 so that other controllers can use it.
  4. Create a controller framework to make it easier to write controllers.

There is work underway on (3), #5012, and (4), #5270.

AFAIK, there's no direct work on (1) or (2), but we are converting existing objects, including ReplicationController, to use the generic registry implementation and have created a pattern for posting status back from controllers as part of #2726.

If the amount of code required for a new controller were sufficiently reduced by other ongoing efforts, we needn't block implementation of new controllers on the generic plugin mechanism, but the plugin mechanism would be of independent utility.

@davidopp
Copy link
Member

I'm not sure we should just say every conceivable variation merits a new replication controller. Per-node/daemon might be different enough from a regular replication controller to deserve being separate, but things that can be expressed trivially with one bool (like restart-on-failure vs. don't) should just be configuration parameters for the RC we already have IMO.

@bgrant0607
Copy link
Member Author

Obsolete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue on api area. area/extensibility area/usability kind/design Categorizes issue or PR as related to design. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

8 participants