Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemon (was Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller)) #1518

Closed
jbeda opened this issue Sep 30, 2014 · 56 comments
Labels
area/api Indicates an issue on api area. area/nodecontroller priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@jbeda
Copy link
Contributor

jbeda commented Sep 30, 2014

There are cases where we want to run a pod on every node. This'll be useful for monitoring things (cAdvisor, DataDog) or replicated storage agents (HDFS node).

Right now you can approximate this by (a) using a hostPort and (b) setting replication count > nodes. It would be better if we had an explicit way of doing this.

@jbeda jbeda added enhancement sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 30, 2014
@thockin
Copy link
Member

thockin commented Sep 30, 2014

Do we really want this to be a scheduling feature? Config files can achieve
this. We avoided adding this feature internally because it add complexity
that we did not feel was required.

It's a cute idea, but it's sort of a layering violation.

On Tue, Sep 30, 2014 at 4:12 PM, Joe Beda notifications@github.com wrote:

There are cases where we want to run a pod on every node. This'll be
useful for monitoring things (cAdvisor, DataDog) or replicated storage
agents (HDFS node).

Right now you can approximate this by (a) using a hostPort and (b) setting
replication count > nodes. It would be better if we had an explicit way of
doing this.

Reply to this email directly or view it on GitHub
#1518.

@bketelsen
Copy link
Contributor

I'm +1 on this. Some things like statsd collectors need to run one per node, and are dependencies of my pods. Keeping all the runtime dependencies together means I have fewer orchestration tools to worry about. And Kubernetes makes sure it stays running.

@thockin
Copy link
Member

thockin commented Sep 30, 2014

As opposed to dropping a pod config on each machine and letting kubelet run
it? Is there a reason the simpler solution is not adequate?

On Tue, Sep 30, 2014 at 4:32 PM, Brian Ketelsen notifications@github.com
wrote:

I'm +1 on this. Some things like statsd collectors need to run one per
node, and are dependencies of my pods. Keeping all the runtime dependencies
together means I have fewer orchestration tools to worry about. And
Kubernetes makes sure it stays running.

Reply to this email directly or view it on GitHub
#1518 (comment)
.

@bketelsen
Copy link
Contributor

forgive my ignorance, but how does one "drop a pod config on each machine" My only interaction with k8s has been through kubecfg so far.

@thockin
Copy link
Member

thockin commented Sep 30, 2014

The assumption is that if you want to run something on each machine, you're
essentially a cluster admin, and can arrange for a config file to appear on
each node, rather than scheduling.

/etc/kubernetes/manifests holds files which are the "manifest" section of a
pod, and kubelet will run those as if it had been scheduled to do so.

On Tue, Sep 30, 2014 at 4:38 PM, Brian Ketelsen notifications@github.com
wrote:

forgive my ignorance, but how does one "drop a pod config on each machine"
My only interaction with k8s has been through kubecfg so far.

Reply to this email directly or view it on GitHub
#1518 (comment)
.

@bketelsen
Copy link
Contributor

interesting. I'll poke around with that concept too. sounds like it would solve this use case well enough.

@bgrant0607 bgrant0607 added the area/api Indicates an issue on api area. label Sep 30, 2014
@bgrant0607
Copy link
Member

This request sounds more for a custom auto-scaler than anything else, though some other features would be useful, such as per-attribute limits (discussed in #367 (comment)).

Most such agents that have been discussed do want host ports, though I understand we want to get rid of host ports. If we did eliminate host ports, we'd need an alternative discovery mechanism; I don't think we want to use existing k8s services. Even for the file-based approach, we probably need that. In #386, I proposed that we represent such services in /etc/hosts within containers. We could give them local magic IPs.

@thockin
Copy link
Member

thockin commented Oct 1, 2014

Or we could add host networking

On Tue, Sep 30, 2014 at 5:02 PM, bgrant0607 notifications@github.com
wrote:

This request sounds more for a custom auto-scaler than anything else,
though some other features would be useful, such as per-attribute limits
(discussed in #367 (comment)
#367 (comment)).

Most such agents that have been discussed do want host ports, though I
understand we want to get rid of host ports. If we did eliminate host
ports, we'd need an alternative discovery mechanism; I don't think we want
to use existing k8s services. Even for the file-based approach, we probably
need that. In #386
#386, I
proposed that we represent such services in /etc/hosts within containers.
We could give them local magic IPs.

Reply to this email directly or view it on GitHub
#1518 (comment)
.

@brendandburns
Copy link
Contributor

I agree with Tim. We should use manifest files on the host nodes to
achieve this.

Brendan
On Sep 30, 2014 5:08 PM, "Tim Hockin" notifications@github.com wrote:

Or we could add host networking

On Tue, Sep 30, 2014 at 5:02 PM, bgrant0607 notifications@github.com
wrote:

This request sounds more for a custom auto-scaler than anything else,
though some other features would be useful, such as per-attribute limits
(discussed in #367 (comment)
<
#367 (comment)
).

Most such agents that have been discussed do want host ports, though I
understand we want to get rid of host ports. If we did eliminate host
ports, we'd need an alternative discovery mechanism; I don't think we
want
to use existing k8s services. Even for the file-based approach, we
probably
need that. In #386
#386, I
proposed that we represent such services in /etc/hosts within containers.
We could give them local magic IPs.

Reply to this email directly or view it on GitHub
<
#1518 (comment)

.


Reply to this email directly or view it on GitHub
#1518 (comment)
.

@jbeda
Copy link
Contributor Author

jbeda commented Oct 1, 2014

The big thing with manifest files is that resultant pods aren't named/tracked by kubernetes. As we do GUI visualizations as to what is running these won't show up.

Finally -- installing/distributing manifest files requires an out of band management system. It would be great if, once k8s was bootstrapped, there was one system for managing all work.

If, instead, we had the kubernetes master/API track local pods in a read-only way, that might help...

@thockin
Copy link
Member

thockin commented Oct 1, 2014

Always with the caveat that we're all ears if someone has a use case that
CAN NOT be satisfied this way :)

On Tue, Sep 30, 2014 at 5:20 PM, Brendan Burns notifications@github.com
wrote:

I agree with Tim. We should use manifest files on the host nodes to
achieve this.

Brendan
On Sep 30, 2014 5:08 PM, "Tim Hockin" notifications@github.com wrote:

Or we could add host networking

On Tue, Sep 30, 2014 at 5:02 PM, bgrant0607 notifications@github.com
wrote:

This request sounds more for a custom auto-scaler than anything else,
though some other features would be useful, such as per-attribute
limits
(discussed in #367 (comment)
<

#367 (comment)

).

Most such agents that have been discussed do want host ports, though I
understand we want to get rid of host ports. If we did eliminate host
ports, we'd need an alternative discovery mechanism; I don't think we
want
to use existing k8s services. Even for the file-based approach, we
probably
need that. In #386
#386, I
proposed that we represent such services in /etc/hosts within
containers.
We could give them local magic IPs.

Reply to this email directly or view it on GitHub
<

#1518 (comment)

.

Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/1518#issuecomment-57401979>

.

Reply to this email directly or view it on GitHub
#1518 (comment)
.

@erictune
Copy link
Member

erictune commented Oct 1, 2014

I predict that some significant fraction of K8s cluster owners will want to run a control loop that automatically adds nodes to a K8s cluster based on a demand signal, such as pending pods. The approach of "setting replication count > nodes" won't work well with that.

@erictune
Copy link
Member

erictune commented Oct 1, 2014

I agree with jbeda that we would want these locally-configured pods to have useful names and to show up in visualizations. And I agree with his suggestion that we should track them as "Read-only" pods. We have experience internally that suggests this works.

I think the "readonly pod" solution is hard to avoid. People are going to think about their (physical or virtual) machines in terms of their raw capacity. But, then some amount of resources will be taken out of that total by the kernel memory, root-namespace files. And then there will be some daemons that people won't want to start using kubernetes, such as sshd (for emergency debugging of kubelet problems) and kubelet (for starting pods in the first place). Those need memory and cpu too. Once we solve exporting that information and visualizing it, we are most of the way to generally handling locally-configured pods.

@thockin
Copy link
Member

thockin commented Oct 1, 2014

set replication count to "inf" could work, but it still feels like a hack
to me.

On Wed, Oct 1, 2014 at 7:37 AM, erictune notifications@github.com wrote:

I predict that some significant fraction of K8s cluster owners will want
to run a control loop that automatically adds nodes to a K8s cluster based
on a demand signal, such as pending pods. The approach of "setting
replication count > nodes" won't work well with that.

Reply to this email directly or view it on GitHub
#1518 (comment)
.

@thockin
Copy link
Member

thockin commented Oct 1, 2014

I think the master should become aware of pods that are running
but that it did not create, and manage in a read-only mode as you suggest.
This is more or less how internal stuff works.

On Tue, Sep 30, 2014 at 5:30 PM, Joe Beda notifications@github.com wrote:

The big thing with manifest files is that resultant pods aren't
named/tracked by kubernetes. As we do GUI visualizations as to what is
running these won't show up.

Finally -- installing/distributing manifest files requires an out of band
management system. It would be great if, once k8s was bootstrapped, there
was one system for managing all work.

If, instead, we had the kubernetes master/API track local pods in a
read-only way, that might help...

Reply to this email directly or view it on GitHub
#1518 (comment)
.

@bgrant0607
Copy link
Member

Replication controller doesn't auto-scale on its own.

@dchen1107
Copy link
Member

I filed #490 a while back to track all pods including kubelet and other daemons which created by manifest files. The pods created through manifest files are having a separate and reserved namespace as appendix to pod name. I don't see any potential issues with read-only mode.

@bgrant0607
Copy link
Member

Filed #1523 for the more specific issue of representing such pods in the apiserver/etcd

@sdake
Copy link

sdake commented Oct 17, 2014

My thoughts on this issue. It is critical for our application to be able to run on every minion without necessarily having to modify the host filesystem to do so.

The specific use case is a project to run OpenStack on top of k8s (http://github.com/stackforge/kolla). This upstream project wants a defined non-hacky way to run a libvirt container and nova-compute container in 1 pod on every minion to provide virtual machine services via OpenStack. Without such a feature, it is impossible to make OpenStack actually run on top of k8s without installing 400+ packages in the host OS. Essentially it would erase any gains containerizing our two containers (1 pod) would provide.

I think the hostport hack would be acceptable, by setting replicationcount to 2^32 or something similar. But as of yet I haven't got this to work. We want to manage all OpenStack services through the kube-apiserver process, rather then having to manually modify the host filesystem system as this creates more complex deployment models. In some cases, manually modifying the host filesystem is extremely difficult for us, especially in the case of something like Atomic, a RHEL7 based operating system without a package manager.

I am hopeful we can come to agreement that this feature is helpful and doesn't add much in the way of complexity or scope creep.

Regarding the mention that adding this feature would result in more complexity, I believe the existing solutions are either hacky (hostport) or more complex (kubelet config file). In the case of the kublet config file, there is no single management interface, instead requiring adding two methods of interfacing with the system.

Regards
-steve

@thockin
Copy link
Member

thockin commented Oct 17, 2014

I disagree that kubelet config files are more complex - but reasonable
people can disagree.

As for the host port hack, it is a hack, and if you set 2^32 replicas it
will actually try to create and schedule 4 billion pods, failing on all but
a tiny fraction of them, and then it will retry that periodically. A
really bad idea. :)

Certainly you have SOMETHING managing your host machine filesystems? We
have found, internally, that it is way easier to manage one-per-machine
jobs as config files.

That sad, I am not STRONGLY against this idea. Looking forward to some
discussion.
On Oct 16, 2014 5:11 PM, "Steven Dake" notifications@github.com wrote:

My thoughts on this issue. It is critical for our application to be able
to run on every minion without necessarily having to modify the host
filesystem to do so.

The specific use case is a project to run OpenStack on top of k8s (
http://github.com/stackforge/kolla https://github.com/stackforge/kolla).
This upstream project wants a defined non-hacky way to run a libvirt
container and nova-compute container in 1 pod on every minion to provide
virtual machine services via OpenStack. Without such a feature, it is
impossible to make OpenStack actually run on top of k8s without installing
400+ packages in the host OS. Essentially it would erase any gains
containerizing our two containers (1 pod) would provide.

I think the hostport hack would be acceptable, by setting replicationcount
to 2^32 or something similar. But as of yet I haven't got this to work. We
want to manage all OpenStack services through the kube-apiserver process,
rather then having to manually modify the host filesystem system as this
creates more complex deployment models. In some cases, manually modifying
the host filesystem is extremely difficult for us, especially in the case
of something like Atomic, a RHEL7 based operating system without a package
manager.

I am hopeful we can come to agreement that this feature is helpful and
doesn't add much in the way of complexity or scope creep.

Regarding the mention that adding this feature would result in more
complexity, I believe the existing solutions are either hacky (hostport) or
more complex (kubelet config file). In the case of the kublet config
file, there is no single management interface, instead requiring adding two
methods of interfacing with the system.

Regards
-steve

Reply to this email directly or view it on GitHub
#1518 (comment)
.

@erictune
Copy link
Member

Custom autoscaler seems like the way to go.

  • have object just like a replication controller, except that instead of a replicas count field, you have a node-selector (would select all nodes in your case).
  • controller manager watches for new minions and creates pods constrained to them.
  • use hostname constraints
  • garbage collect pods after 1 day of not seeing a node or something.

@lavalamp
Copy link
Member

We need to add constraints to scheduler to make this work.

@bgrant0607
Copy link
Member

@erictune's proposal is along the lines of what I was thinking. Definitely NOT a replication controller with an infinite count.

Rather than a hostname constraint, we could expose a node parameter on POST to /pods. Since pods don't reschedule, this constraint doesn't need to be part of the pod spec. This would also enable use of pod templates without getting into field overrides. Internally, we could add it to scheduling constraints, if we chose, but one could potentially also just bypass the scheduler in this case, so long as the apiserver verified feasibility, which is necessary if we want to support multiple schedulers, anyway.

The main thing to worry about is races: daemons getting evicted due to missing nodes, then not being the first thing to schedule back on the nodes when they become healthy again. To prevent this, we'd almost certainly need to add forgiveness (#1574).

As for whether we should support this feature:

  • One one hand, it's another feature, so it adds complexity.
  • On the other hand, we likely want to reduce our reliance on external configuration management systems, especially for application deployment, of which this is one example.

The core functionality seems pretty isolated from everything else, and what it needs are things we'd like to add for other reasons. Maybe it could be implemented as a plugin, which would also make it easier to rip out if we decided it was a bad idea. I'm in favor, but can't see it as high priority at the moment since there is a workaround.

@bgrant0607
Copy link
Member

cc @AnanyaKumar

@bgrant0607
Copy link
Member

@ravigadde Please comment regarding whether #10210 satisfies your use case.

@davidopp
Copy link
Member

I like the idea of having the node controller schedule daemon pods. Putting in one place all the logic that has to run before a node is considered ready to accept regular pods makes a lot of sense.

But I'm not sure why you're mixing Deployment into this. It seems that just as we built ReplicationController as one component and plan to layer Deployment on top to allow declaratively describing state of the whole cluster and orchestrating rolling updates, we can build DaemonController as one component (that is mostly identical to ReplicationController except it uses % of nodes where ReplicationController uses # of replicas) and then layer Deployment on top to do those other things. The only tricky part is ensuring the % maps consistently, but IIRC you and @lavalamp had some ideas on how we could do that. Anyway, we can discuss that on the other issue.

@davidopp
Copy link
Member

(To clarify my previous comment, when I refer to DaemonController I'm talking about logic, not necessarily a separate component; as you said, we could put the logic in the node controller).

@ravigadde
Copy link
Contributor

@bgrant0607 Thanks for your comments. I will add my comments to #10210

+1 for adding this logic to node controller

@bgrant0607 bgrant0607 changed the title Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller) Daemon (was Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller)) Aug 27, 2015
@mikedanese
Copy link
Member

I think this is done and we can close. Thanks @AnanyaKumar! Still todo (and discuss) is:

  • handle daemonset podtemplate update
  • fold daemonset controller into nodecontroller

@zhangxiaoyu-zidif
Copy link
Contributor

I have a question:
when i create the ds, the memory did not meet the request memory, and then when momory is enough, does daemonset start its pod?
If the pod was not created when memory is insuffient, when memory is suffient, could the pod of ds start?

bertinatto pushed a commit to bertinatto/kubernetes that referenced this issue Apr 12, 2023
…e-disable

UPSTREAM: <carry>: OCPNODE-1548,OCPNODE-1584: disable load balancing on created cgroups when managed is enabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue on api area. area/nodecontroller priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests