New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubelet taking large amounts of cpu in v0.19.3 #10451
Comments
@dchen1107 what traces should @r-tock collect before restarting the kubelet? |
After snooping around a bit more I suspect the stats pushing as the cpu gulper. Using sysdig on one of the node I see that the core utilization spikes(10 seconds apart) coincides with the stats push.
I restarted kubelet in 4 other nodes and cpu has been coming down. Without looking at the code this feels like some piece of data is getting accumulated and it is getting costlier and costlier to push/process that data over time as the server continues to accumulate more of it. |
I still have the one misbehaving node. Let me know till how long I should keep them around. |
pprof is enabled on port 10250. Try |
No success using go 1.4.2 toolchain
|
Port 10250 is https, not http. |
That gives an error which to me sounds like the server is not returning the full certificate chain...
|
Right, it's using a self signed cert. I'm not sure if it's possible to convince go tool pprof to ignore certificates. |
Does the other profile on port 10248 help? |
@r-tock can you take a profile of Kubelets with and without the issue? Especially the ones you restarted. The CPU usage you mentioned seems to be coming from processing stats requests, I'm curious if when you restarted the requests temporarily ceased. How many pods/containers are being run on the machine btw? 25% usage is not out of line with our expected numbers for a more heavily used machine. |
@vmarmol To answer your last question, we run one container on these nodes (apart from the standard gke containers) |
@r-tock -- I was chatting with @bradfitz last night and he said that it isn't currently possible to use |
@roberthbailey ack. The top node is currently running hot at 85% cpu usage. Let me know if there is anything I can help here on this bug. |
This is known issue because heapster collects /stats too frequently. PR #9809 was merged for v0.20 release should mitigate the issue. |
cc/ @yujuhong since she was measuring kubelet and other daemons' resource usages based on the workload a while back. |
I took some measurements a while ago, but I took them with heapster disabled. Also, the codebase has changed quite a lot since then. Overall, the cpu usage of kubelet and docker in the steady state (no pod addition/deletion) roughly increases with the number of pods until it hit 30 pods on n1-standard-4 node (after that, docker became the rate limiter). Any readiness check (live probe) would also increase the kubelet cpu usage. I just created a cluster with HEAD using two n1-standard-1 nodes and will monitor the cpu usage overnight. So far the kubelet usage (reported by cadvisor) for the two nodes are 3%-14% (spike) and 1.5%-8%, respectively. The difference of the cpu usage is caused by the distribution of add-on pods. The usage is still low compared with what was reported above. @r-tock's figure showed that the minimum cpu usage for kubelet is ~15%, excluding the spikes. @r-tock, here are some questions that I hope would help clarify the situation:
By the way, in addition to the the /stats requests, kubelet also syncs all pods every 10 seconds. The syncing usually causes a spike in cpu usage for both kubelet and docker. kubelet.log does not show the syncing messages if the log level is <4. |
Answers
|
@r-tock I believe you are running GKE, which means you are at v0.19 kubernetes release. For v0.19 release, it is known issue that heapster collects /stats too frequently, which contributes largely to kubelet cpu spike. |
/ack. What is the eta on the 0.20.1 release? |
Some update about kubelet's resource usage after letting the cluster run for >16 hours. As mentioned before, this is a lightly-loaded cluster with primarily add-on pods with an extra pause pod per node
The cloud console reports a stable 11% cpu usage for the node, with steady incoming network traffic ~24xKB. I don't see any dramatic increase in resource usage, but I will keep monitoring for a while. |
Please also see: |
What is the ETA of pushing this to GKE? We still have all our alerts flying and oncall restarting kubelet twice every day and would like to be in a calmer state. |
We are in the process of pushing 0.21 to GKE today and tomorrow for new clusters. We will then begin the process of upgrading existing clusters (likely Monday) If you would like to be one of the initial test upgrades (possibly even earlier than Monday) send me mail at bburns [at] google.com. Also, in the meantime, a work around would be to disable the health check on the DNS pod. Basically
Edit the json to remove the health check
and see if that fixes things. |
Thanks @brendandburns we will take care of deploying the world (our world) tomorrow as soon as the release hits. Will try the health check fix |
@brendandburns updating kube-dns by removing the livenessProbe did not reduce the cpu, I had to manually restart kubelet. |
The cpu usage has been down since the last release. But for g1-small nodes its hovering around 23%, I would keep the investigation around to reduce that further. Also the bandwidth usage from these nodes have been a steady 256KB/s |
Is this a node that is running DNS? There's a "bug" in older dockers that On Thu, Sep 24, 2015 at 8:26 AM, Philibert Dugas notifications@github.com
|
Sorry for not writing those info! It's running 1.0.1 and not running the DNS. The node itself is running 8 pods. 1 being a redis pod, 1 being fluentd pushing to elastic search The other pods are java applications |
@PhilibertDugas, I'd like to know more about your cluster. Thanks!
|
@PhilibertDugas, thanks for the information and the graph. It'd be great if you can get the cpu profile on a node where kubelet has high cpu usage. You can get a 30-second cpu profile by running |
Here is the cpu profile: I tried it on other node and it looked completely different! Something looks way off with this one |
IIUC, it looks like cadvisor's On Fri, Sep 25, 2015 at 8:41 AM, Philibert Dugas notifications@github.com
|
@PhilibertDugas, thanks a lot for the profile! Does other nodes exhibit high cpu usage as well?
It certainly looks like that's the case here. If the cadvisor hasn't changed that much since kubernetes v1.0.1, this call is checking the aufs directory for the container. @PhilibertDugas, perhaps you can manually check if |
The other nodes with different graphs do not exhibit any high cpu usage, I ran the
Took around 1 minute, If I run it again,
Seem much quicker, but that might be due to caching |
@PhilibertDugas, yes, the second attempt is probably due to disk buffering. cadvisor does Is there anything i/o heavy running on the node or is there any difference in the machine spec for this node? |
I'm also getting kubelet with high CPU usage. I'm using the vagrant VMs and after a while the VM kills/freeze the ssh connection to the point I'm unable to reach the VM again. |
@yujuhong , from what I see, there is nothing that seems anormal, The machine has 8 cores, I/O is ~15 transfer/s, ~15 write/s I compared the stats with another machine that is running the same configuration (Kubernetes 1.0.1, etc) which doesn't have high cpu issues and only 1 metric seemed different. The context switching for the High CPU node was ~ 15K context switch/s while the other machine with lower cpu is hovering around 6-7K context switch/s I will try and see if there's anything differing between those two nodes! |
@PhilibertDugas, your directory could just be deeper and takes longer time to run |
@yujuhong Thanks for this ! |
Ever since we spun on a new cluster the cpu has been slowly climbing
The cloud monitoring console is showing this climb
This shows the cpu usage increase over the course of 2 days.
On sshing in to the node, I see that kubelet is taking most of the cpu.
Let me know if you would like traces to investigate this. I can easily restart the kubelet to solve this issue, however I will wait for getting you some traces before I do so.
The text was updated successfully, but these errors were encountered: