🐛 cloudflared grows without bounds when request rate exceeds upload bandwidth, destabilizing host #1205

bcle · 2024-03-14T17:45:15Z

Describe the bug
When incoming http requests result in a requested data volume that exceeds the host's upload bandwidth, response bodies from the origin are buffered and queued to be copied out to the network. At the same time, new requests arrive, get proxied, the origin responds to them quickly, and their responses join the queue to be copied out. This continues without limit. This can be seen in the cloudflared_tunnel_concurrent_requests_per_tunnel metric, which goes from zero to over 10k.

Consequences:

Cloudflared resident memory keeps growing. I've seen it go from several hundred MB to several gigabytes in minutes.
It probably consumes an ever increasing number of goroutines and perhaps stacks, buffers, file descriptors.
Eventually, clients that have been waiting for longer than 30 seconds start timing out and disconnecting, resulting in the agent logging a flood of stream 3052609 canceled by remote with error code 0 errors.
After a few minutes, the agent starts experiencing I/O errors on connections with the origin, e.g. read tcp [::1]:34464->[::1]:443: read: connection reset by peer. This is probably due to the memory and resource pressure on the host OS.
In our particular case, cloudflared logs are shipped to the cloud (Datadog), and the high log volume (due to errors) exacerbates the upload bandwidth problem.
The host becomes sluggish, and eventually unreponsive to remote management attempts

To Reproduce

This was experienced with Cloudflare Tunnel on multiple customer environments. We have a cloud-based process that requests a steady stream of data from customer's hosts. Customers sites that have ample bandwidth do fine. Sites that have constrained upload bandwidth experience this problem.

For the full details of several incidents, please refer to Cloudflare Support ticket 3133620.

Expected behavior

Cloudflared should provide an optional mechanism to protect itself (and the host) from this problem. One proposal is to have a new setting to limit the number of oustanding requests, which is already tracked with the cloudflared_tunnel_concurrent_requests_per_tunnel metric. When the limit is exceeded, the agent could respond to new requests with HTTP status 429 ("too many requests"). We implemented and successfully tested this in this PR in a forked repo: spotai#1

Environment and versions

OS: Linux / Ubuntu 22.04
Architecture: amd64
Version: 2024.2.1

The text was updated successfully, but these errors were encountered:

crrodriguez · 2024-05-05T17:18:52Z

So a textbook case of no network congestion control..Userspace tool unlikely to solve the problem, cloudflared_tunnel_concurrent_requests_per_tunnel will paper over a totally different issue.

bcle added Priority: Normal Minor issue impacting one or more users Type: Bug Something isn't working labels Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 cloudflared grows without bounds when request rate exceeds upload bandwidth, destabilizing host #1205

🐛 cloudflared grows without bounds when request rate exceeds upload bandwidth, destabilizing host #1205

bcle commented Mar 14, 2024

crrodriguez commented May 5, 2024

🐛 cloudflared grows without bounds when request rate exceeds upload bandwidth, destabilizing host #1205

🐛 cloudflared grows without bounds when request rate exceeds upload bandwidth, destabilizing host #1205

Comments

bcle commented Mar 14, 2024

crrodriguez commented May 5, 2024