You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When incoming http requests result in a requested data volume that exceeds the host's upload bandwidth, response bodies from the origin are buffered and queued to be copied out to the network. At the same time, new requests arrive, get proxied, the origin responds to them quickly, and their responses join the queue to be copied out. This continues without limit. This can be seen in the cloudflared_tunnel_concurrent_requests_per_tunnel metric, which goes from zero to over 10k.
Consequences:
Cloudflared resident memory keeps growing. I've seen it go from several hundred MB to several gigabytes in minutes.
It probably consumes an ever increasing number of goroutines and perhaps stacks, buffers, file descriptors.
Eventually, clients that have been waiting for longer than 30 seconds start timing out and disconnecting, resulting in the agent logging a flood of stream 3052609 canceled by remote with error code 0 errors.
After a few minutes, the agent starts experiencing I/O errors on connections with the origin, e.g. read tcp [::1]:34464->[::1]:443: read: connection reset by peer. This is probably due to the memory and resource pressure on the host OS.
In our particular case, cloudflared logs are shipped to the cloud (Datadog), and the high log volume (due to errors) exacerbates the upload bandwidth problem.
The host becomes sluggish, and eventually unreponsive to remote management attempts
To Reproduce
This was experienced with Cloudflare Tunnel on multiple customer environments. We have a cloud-based process that requests a steady stream of data from customer's hosts. Customers sites that have ample bandwidth do fine. Sites that have constrained upload bandwidth experience this problem.
For the full details of several incidents, please refer to Cloudflare Support ticket 3133620.
Expected behavior
Cloudflared should provide an optional mechanism to protect itself (and the host) from this problem. One proposal is to have a new setting to limit the number of oustanding requests, which is already tracked with the cloudflared_tunnel_concurrent_requests_per_tunnel metric. When the limit is exceeded, the agent could respond to new requests with HTTP status 429 ("too many requests"). We implemented and successfully tested this in this PR in a forked repo: spotai#1
Environment and versions
OS: Linux / Ubuntu 22.04
Architecture: amd64
Version: 2024.2.1
The text was updated successfully, but these errors were encountered:
So a textbook case of no network congestion control..Userspace tool unlikely to solve the problem, cloudflared_tunnel_concurrent_requests_per_tunnel will paper over a totally different issue.
Describe the bug
When incoming http requests result in a requested data volume that exceeds the host's upload bandwidth, response bodies from the origin are buffered and queued to be copied out to the network. At the same time, new requests arrive, get proxied, the origin responds to them quickly, and their responses join the queue to be copied out. This continues without limit. This can be seen in the
cloudflared_tunnel_concurrent_requests_per_tunnel
metric, which goes from zero to over 10k.Consequences:
stream 3052609 canceled by remote with error code 0
errors.read tcp [::1]:34464->[::1]:443: read: connection reset by peer
. This is probably due to the memory and resource pressure on the host OS.To Reproduce
This was experienced with Cloudflare Tunnel on multiple customer environments. We have a cloud-based process that requests a steady stream of data from customer's hosts. Customers sites that have ample bandwidth do fine. Sites that have constrained upload bandwidth experience this problem.
For the full details of several incidents, please refer to Cloudflare Support ticket 3133620.
Expected behavior
Cloudflared should provide an optional mechanism to protect itself (and the host) from this problem. One proposal is to have a new setting to limit the number of oustanding requests, which is already tracked with the
cloudflared_tunnel_concurrent_requests_per_tunnel
metric. When the limit is exceeded, the agent could respond to new requests with HTTP status 429 ("too many requests"). We implemented and successfully tested this in this PR in a forked repo: spotai#1Environment and versions
The text was updated successfully, but these errors were encountered: