Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttle bulkhead from the circuit breaker on state transition #247

Open
wants to merge 3 commits into
base: mkipper/global-circuit-breaker-simple-integer
Choose a base branch
from

Conversation

michaelkipper
Copy link

What

This PR attempts to address #244. It throttles the number of bulkhead tickets to 1 when the circuit opens, throttles to success_threshold when the circuit moves to half_open, and removes the throttle when the circuit finally closes.

Why

When the circuit is in an open state, we still allow all the workers that can acquire tickets to attempt the transition to half_open. When the circuit is half open, we allow all those workers to attempt the transition to closed, which can sap much needed capacity during a sustained outage.

Note: This PR is still a work in progress, and there are several corner cases that need to be addressed. I'm looking for a high-level review of the approach.

Issues

  • It's a little gross to couple the bulkhead and circuit breaker in this way, but I'm not sure how else to approach it.
  • Not sure how SEM_UNDO works when one process applies the throttle, and another removes it. Or what happens when the process that applied the throttle terminates. Needs to be thought through more.

cc: @Shopify/servcomm

@michaelkipper michaelkipper force-pushed the mkipper/global-circuit-breaker-throttle-half-open branch from 5e9b561 to 13a7241 Compare July 4, 2019 16:03
Copy link

@thegedge thegedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you had any success with semsim to show the capacity before and after?

I'm also interested in how this behaves with a large error_timeout. In particular, we may have capacity improvements, but that would be at the expense for a slower recovery. In the worst case scenario, where all workers had their circuit open at the same time, n - 1 workers will wait (edit: at least) an additional error_timeout seconds after recovery. I think I got that right 🤔

void test_sliding_window()
{
semian_simple_sliding_window_shared_t window;
assert_le_int(sizeof(window), 4096, "window size is greater than a page");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about sysconf(_SC_PAGESIZE) instead of 4096?

@michaelkipper
Copy link
Author

Have you had any success with semsim?

No, I don't think this works at all...
SI_SEM_CONFIGURED_TICKETS is basically a bookkeeping value. The real value we want to modify is SI_SEM_TICKETS.
The problem is that at the point when we most likely want to reduce SI_SEM_TICKETS to 1, it's already close to zero. If we attempt to do a -throttle operation, that worker would block until the rest of the workers timed out on their requests. Given that we know we're opening the circuit, it would be awesome if we could terminate those requests immediately, but my brief research into this suggests that we'd have to take down the worker with a SIGTERM which is likely a very expensive operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants