New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometime Liveness/Readiness Probes fail because of net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) #89898
Comments
/sig network |
/triage unresolved Comment 🤖 I am a bot run by vllry. 👩🔬 |
Are you sure the application don't hit the resources limits? In my case, the application starting fine, then the container start using more resources until he hit the limit. After that the readiness probes fail |
Yes, I am sure. I collected the metrics. |
/area kubelet |
any update on this? we have been facing similar issues since few weeks |
This is the first such report I have seen. There's nothing obvious about why this would happen. It's possible kubelet is too busy and starved for CPU and the probe happened to be thing that got thrashed. How many pods are on this machine? How busy is it? It's possible the node itself is thrashing or something and OOM behavior is weird. Does dmesg show any OOMs? It's possible some other failure down deep in kubelet is being translated into this? You could try running kubelet at a higher log level to get more details on what is happening. A lot of bugs have been fixed since 1.12, so we'd need to try to reproduce this and then try again in a more recent version. Is there any way you can help boil down a simpler reproduction? |
@thockin If you aren't able to handle this issue, consider unassigning yourself and/or adding the 🤖 I am a bot run by vllry. 👩🔬 |
@thockin For example, if you see the above get pods, Also, this is happening for this particular deployment only. The other pods are running fine. "returns-console-558995dd78-pbjf8 1/1 Running 0 23h |
We too are facing the same issue in our cluster. Kubernetes Version - 1.16.8 Installed using Kubespray In our case, timeouts were not related to application but related to specific nodes in cluster. In our 13 node cluster, node 1 to 4 had some kind of issue wherein the pods running on these nodes had random failures due to timeouts. Checked that there weren't any cpu aur memory usage spikes also. P.S We are using NodePort for production use case. Is it possible that the node port setup cannot handle too many socket connections? |
I have no idea what might cause spurious probe failues. @derekwaynecarr have you heard any other reports of this? @tarunwadhwa13 are you saying PROBES failed (always same node) or user traffic failed? If you have any other information about what was going on with those nodes when the failures happen, it would help. Check for OOMs, high CPU usage, conntrack failures? |
@thockin Conntrack shows hardly 2 or 3 errors. Memory consumption is 60-65% per node. Just found that the timeouts were for almost all request and not just probe. We added istio lately to check connection stats and understand if the behaviour was due to application. But the findings were weird. Istio itself is now failing readiness probe quite frequently 157 failures in ~3 hours Would like to add that kubernetes is running in our Datacenter. And since iptables version is 1.4.21, --random-fully is not being implemented. But since all machines have same configuration, we ruled out this possibility |
I apologize for not having a lot of details to share but I'd add my 2 cents. We recently upgraded from Istio 1.4.4 to 1.5.4 and started seeing the issues described by OP. Lots of liveness / readiness issues there were not happening before. It SEEMS like adding let's say a 20sec initial delay had helped in most cases. At this point we are still seeing it and not sure what the root cause is. running on EKS 1.15 (control plane) / 1.14 managed nodeGroups |
@thockin If you aren't able to handle this issue, consider unassigning yourself and/or adding the 🤖 I am a bot run by vllry. 👩🔬 |
I'm afraid the only way to know more is to get something like a tcpdump from both inside and outside the pod, which captures one or more failing requests. Possible? |
I'm getting same issue.
Liveness on an Nginx container looks like this: liveness:
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
httpGet:
path: /list/en/health/ping
port: 80 UPD: Strange thing is that completely distinct deployments you can see above staging4 staging5 stagingN - above 10 deployments fail at once. My possible problem might be in missing |
Having the same problem here. livenessProbe:
httpGet:
path: /status
port: 80
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 10
failureThreshold: 3 Error cca 3 - 10 times a day:
The service operates normally and responds to /status in 5ms, though. Running on Azure Kubernetes Service |
I'm facing the same issue as well, increasing timeoutSeconds didn't help. livenessProbe:
httpGet:
path: /ping
port: http
failureThreshold: 5
initialDelaySeconds: 5
periodSeconds: 20
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /ping
port: http
initialDelaySeconds: 5
periodSeconds: 20
timeoutSeconds: 10 Runing on Kubernetes v1.16.7 on AWS deployed via KOPS |
I appreciate the extra reports. It sounds like something is really weird. I'll repeat myself from above: I'm afraid the only way to know more is to get something like a tcpdump from both inside and outside the pod, which captures one or more failing requests. Is that possible? Without that I am flying very blind. I don't see this experimentally and I'm not flooded with reports of this, so it's going to be difficult to pin down. If you say you see it repeatedly, please try to capture a pcap? |
@thockin I'll try to get a dump if I'm able to replicate this issue consistently, since it tends to happen randomly. Sorry, I'm relatively new to this 😅 |
@thockin If you aren't able to handle this issue, consider unassigning yourself and/or adding the 🤖 I am a bot run by vllry. 👩🔬 |
Why is kubelet opening new connections for every probe? Shouldn't it re-use the connection when it can? |
I's been that way since v1.2 80287e3
that is the conntrack wait period, it will disappear once the socket closes. The TIME_WAIT period for the socket is a constant of the kernel, however, users can override it per socket using the SO_LINGER options. I see 2 possible directions:
|
Option 1 is the "right" answer, but may be harder. The probe code isn't my
favorite...
…On Mon, Jan 16, 2023 at 3:37 PM Antonio Ojea ***@***.***> wrote:
Why is kubelet opening new connections for every probe? Shouldn't it
re-use the connection when it can?
I's been that way since v1.2 80287e3
<80287e3>
Also: net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120 - isn't that
the TIME_WAIT period?
that is the conntrack wait period, it will disappear once the socket
closes.
The TIME_WAIT period for the socket is a constant of the kernel, however,
users can override it per socket using the SO_LINGER options.
I see 2 possible directions:
1. reusing connections that means we have to refactor all the probes
(for http I think that golang will handle it)
2. setting SO_LINGER to 0 or a much lower value (this is probing so
the risk of loosing non-ack data is not important) and decrease the
TIME_WAIT value
—
Reply to this email directly, view it on GitHub
<#89898 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVF3U6X4D3WZS6HJQX3WSXLUHANCNFSM4MCMLXCA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
let me see how far I go with 1 |
I did 1 and 2 XD #115143 |
I have to leave 1 out, the results are very confusing and not deterministic. I think that option 2 for probes is the best option, less disruptive and the results I obtained locally are very promising. |
Will you revisit HTTP reuse?
…On Wed, Jan 18, 2023 at 5:15 AM Antonio Ojea ***@***.***> wrote:
I have to leave 1 out, the results are very confusing and not
deterministic.
The way http reuses connections c differs between http1 and http2, and
golang stdlib behave also different for those protocols.
I think that option 2 for probes is the best option, less disruptive and
the results I obtained locally are very promising.
—
Reply to this email directly, view it on GitHub
<#89898 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVCAZSVXLTBQ7WFKUPTWS7UIHANCNFSM4MCMLXCA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
That's a can of worms, http1 Vs http2 thing is complex , let's this soak and revisit one we are able to get some data about the impact of this change |
FWIW - I found this article useful in explaining TIME-WAIT and potential |
I am still facing the same issue. |
Same here, still facing this issue. |
please indicate with which version are you facing the issue and how many probes are configured in the node, and if they are http/TCP probes or exec probes |
hi, i'm facing this issue right now, too, in on of my namespaces after Digital Ocean did an automatic Update to 1.24.16-do.0. it pulls the image successfully, but then crashes and throws me Liveness probe failed: Get "http://<my.ip>/actuator/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers) |
No, that would fail to detect a pod doing graceful shutdown, wouldn't it? Kubelet needs to know whether the pod is accepting new connections, not just whether it's continuing to process old connections. |
@aojea do you have example of how to implement option 2 (since option 1 doesn't seem to be doable at this point)? |
it is already implemented in #115143 |
@aojea ok so still happening for me on 1.26 host, have to confirm on 1.27. Do you know which release this made it into? Seems like there is an open PR to fix the http reuse which would fix option 1. |
Looks like it was part of 1.27. I’ll upgrade that cluster this week. |
@cdenneen did you get a change to upgrade and did you see any improvement? |
Hello, "reason": "Unhealthy", "reason": "Unhealthy", "involvedObject": { "reason": "Unhealthy", 2024-03-27 04:48:27.774 [INFO][1] resources.go 350: Main client watcher loop ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/priority-and-fairness-config-consumer ok\n[+]poststarthook/priority-and-fairness-filter ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[+]poststarthook/crd-informer-synced ok\n[+]poststarthook/bootstrap-controller ok\n[+]poststarthook/rbac/bootstrap-roles ok\n[+]poststarthook/scheduling/bootstrap-system-priority-classes ok\n[+]poststarthook/priority-and-fairness-config-producer ok\n[+]poststarthook/start-cluster-authentication-info-controller ok\n[+]poststarthook/aggregator-reload-proxy-client-cert ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[+]autoregister-completion ok\n[+]poststarthook/apiservice-openapi-controller ok\nhealthz check failed") has prevented the request from succeeding status=500 |
What happened:
In my cluster sometimes readiness the probes are failing. But the application works fine.
What you expected to happen:
Successful Readiness Probe.
How to reproduce it (as minimally and precisely as possible):
We have few clusters with different workloads.
Only in cluster with big number of short living pods we have this issue.
But not on all nodes.
We can't reproduce this error on other our clusters (that have same configuration but different workload).
How i found the issue?
I deployed a daemonset:
Then i started to listen events on all pods of this daemonset.
After a couple of time i received next events:
Those events where on ~50% of pods of this daemonset.
On the nodes where the pods with failed probes was placed, I collected the logs of kubelet.
And there was errors like:
I was thinking that sometimes the nginx in pod really response slowly.
For excluding this theory, I created a short script that curl the application in pod and store response time in a file:
I run this script on node where the pod is placed for 30 minutes and i get the following:
There was
482670
measurements and the longest response time was0.007
.In logs of pod there are only message with response code 200 (from my requests and requests of readiness probes):
It means that part of probes are successful.
Then i stopped the curl script (because the big number of logs).
I waited while new error with failed probe appears in kubelet logs.
And in logs of that pod with nginx I didn't find the request generated by this probe:
If I restart the kubelet - the error don't disappear.
Have someone any suggestions about this?
Environment:
cat /etc/os-release
): ubuntu 16.04uname -a
): 4.15.0-66-genericSeems like the problem appears in many different installations - #51096
/sig network
The text was updated successfully, but these errors were encountered: