Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悰 cloudflared grows without bounds when request rate exceeds upload bandwidth, destabilizing host #1205

Open
bcle opened this issue Mar 14, 2024 · 1 comment
Labels
Priority: Normal Minor issue impacting one or more users Type: Bug Something isn't working

Comments

@bcle
Copy link

bcle commented Mar 14, 2024

Describe the bug
When incoming http requests result in a requested data volume that exceeds the host's upload bandwidth, response bodies from the origin are buffered and queued to be copied out to the network. At the same time, new requests arrive, get proxied, the origin responds to them quickly, and their responses join the queue to be copied out. This continues without limit. This can be seen in the cloudflared_tunnel_concurrent_requests_per_tunnel metric, which goes from zero to over 10k.

Consequences:

  • Cloudflared resident memory keeps growing. I've seen it go from several hundred MB to several gigabytes in minutes.
  • It probably consumes an ever increasing number of goroutines and perhaps stacks, buffers, file descriptors.
  • Eventually, clients that have been waiting for longer than 30 seconds start timing out and disconnecting, resulting in the agent logging a flood of stream 3052609 canceled by remote with error code 0 errors.
  • After a few minutes, the agent starts experiencing I/O errors on connections with the origin, e.g. read tcp [::1]:34464->[::1]:443: read: connection reset by peer. This is probably due to the memory and resource pressure on the host OS.
  • In our particular case, cloudflared logs are shipped to the cloud (Datadog), and the high log volume (due to errors) exacerbates the upload bandwidth problem.
  • The host becomes sluggish, and eventually unreponsive to remote management attempts

To Reproduce

This was experienced with Cloudflare Tunnel on multiple customer environments. We have a cloud-based process that requests a steady stream of data from customer's hosts. Customers sites that have ample bandwidth do fine. Sites that have constrained upload bandwidth experience this problem.

For the full details of several incidents, please refer to Cloudflare Support ticket 3133620.

Expected behavior

Cloudflared should provide an optional mechanism to protect itself (and the host) from this problem. One proposal is to have a new setting to limit the number of oustanding requests, which is already tracked with the cloudflared_tunnel_concurrent_requests_per_tunnel metric. When the limit is exceeded, the agent could respond to new requests with HTTP status 429 ("too many requests"). We implemented and successfully tested this in this PR in a forked repo: spotai#1

Environment and versions

  • OS: Linux / Ubuntu 22.04
  • Architecture: amd64
  • Version: 2024.2.1
@bcle bcle added Priority: Normal Minor issue impacting one or more users Type: Bug Something isn't working labels Mar 14, 2024
@crrodriguez
Copy link
Contributor

So a textbook case of no network congestion control..Userspace tool unlikely to solve the problem, cloudflared_tunnel_concurrent_requests_per_tunnel will paper over a totally different issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Normal Minor issue impacting one or more users Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants