[DRAFT] Share circuit breakers between workers #227

michaelkipper · 2019-06-12T19:54:12Z

This PR fixes #27.

What

Moves the implementation of circuit breakers to shared memory so circuit errors can be shared between workers on the same host.

Why

Currently, for single-threaded workers, we need to see error_threshold consecutive failures in order to open a circuit. For applications where timeouts are necessarily high (e.g. 25 seconds for MySQL) that translates into up to 75 seconds of blocking behaviour when a resource is unhealthy.

If all workers on a host experience the outage simultaneously, that information can be shared and the circuit can be tripped in a single timeout iterations, increasing the speed of detecting the unhealthy resource and opening the circuit.

In addition, in the case of a variable latency outage of an upstream resource, the collective hivemind can share information about the increased latency and detect trends earlier and more reliably. This feature is not implemented yet, but becomes possible after this change.

How

Note: This is very much a draft, and needs a lot of refactoring, but it's a start.

The main idea is ~~to move the data for Simple::Integer and Simple::SlidingWindow to shared memory~~ to move the data for Simple::Integer to a semaphore and `Simple::SlidingWindow to shared memory.

Simple::State simply uses the shared Simple::Integer as its backing store to easily share state between workers on the same host.

Feel free to take a look around, but I've still got some refactoring to do.

~~In particular, this implementation isn't thread-safe yet and requires some serious locking.~~

cc: @Shopify/servcomm
cc: @sirupsen, @csfrancis, @byroot

thegedge

Some high-level first thoughts.

thegedge · 2019-06-12T20:11:54Z

ext/semian/simple_integer.c

+  }
+
+  return (int*)val;
+}


@name doesn't look to be changing after initialization. Could we cache the shmid in semian_simple_integer_t? How about the void *val?

Similarly for circuit_breaker.c and sliding_window.c.

Yes, I actually addressed this in a later commit.
Basically, we want to store the key generated from the name on the object itself.
I'll double check I've done this everywhere.

ext/semian/simple_integer.c

ext/semian/sliding_window.c

pushrax · 2019-06-12T21:22:43Z

If we have this, I'm fairly sure we should also decouple bulkheads from circuits. At the moment (or at least as of the end of 2018), a bulkhead acquisition failure marks an error in the same resource's circuit. There's a lengthy discussion in https://github.com/Shopify/shopify/pull/178393. I'll reproduce the big points for open source sake. At the end, my conclusion was that sharing circuits between processes and decoupling bulkheads would be an attractive option, which is possible after this PR!

@ashcharles

If Semian is unable to acquire a ticket due to bulkheading, an error is thrown. Multiple errors may exceed the configured error_threshold for the resource causing the circuit to open. While this behaviour is tested in Semian, document it explicitly here with a test for the case of Redis to make it clear that repeated bulkheading on Redis may lead to open-circuit errors.

I found this behaviour surprising as did others in the ensuing discussion. I chose code > documentation. I thought that adding a test to core would make it clear for folks who aren't necessarily looking at the Semian code-base.

@jpittis

TLDR: I can't think of a reason why (besides backward compatibility) that we wouldn't want to remove this behaviour from Semian.

This is a really good test case to articulate an important property of Semian.

Bulkheads triggering does not mean a protected service is experiencing any kind of issue. But we still open a circuit and fast fail requests for the duration of error timeout.

I suspect that this has caused us to fail requests to Redis (and other protected services) when the service was not under duress.

@pushrax

Thanks for bringing this up. I agree it may not be desired behaviour.

The high level goal of a circuit breaker is: "if a request is likely to fail, don't issue it." This is done for two reasons. First, and most importantly, it reduces unnecessary load on the dependency, which is particularly crucial when the dependency is failing due to overload. Second, it reduces unnecessary load on the dependent, which is crucial when the dependency is timing out and taking up too much dependent capacity.

The relevant property to notice is: a circuit breaker is not useful when the cost of a failed request is similar to the cost of evaluating the circuit breaker. When a bulkhead is full, further requests will fail without hitting the network at all, thus failure is cheap. Circuit breakers aren't providing their intended improvement when applied to bulkhead failure. I agree it makes sense to remove this behaviour, it complicates thinking about tuning. An oscillating distributed bulkhead is even more confusing to reason about than a regular distributed bulkhead.

@sirupsen

I am not sure I am sold on removing this behaviour. I've always thought the interaction between them was useful. If tickets are 50% of workers and timeouts are high you'll be "wasting" 50% of workers on timeouts to the resource. After 3 x error_threshold you'd be back to having 100% of capacity (if the half_open_resource_timeout is reasonable).

Imagine having a 25s timeout on MySQL. When all tickets are taken, clearly that MySQL is a problem and we trigger circuits rapidly. If you waited only for the circuits, you'd have to wait 3 x 25s. If you had only bulkheads, you're "wasting" 50% of capacity. If you do both, you seem to get the best of both worlds (as long as you have the half open timeout).

@pushrax

I think this is a case where both positions are correct depending on the tuning and scenario. Simulation would be a great way to show exactly how this is true.

I keep forgetting circuits are not shared between processes, while bulkheads are. Your idea ends up working around this by sharing state between processes through the bulkheads to influence the circuits.

Vertical axis is resource usage (in tickets), horizontal is time:

Assume t0 is the resource timeout. My earlier comment about tripping the circuit not being useful here was with reference to the overlapping time between the marked t0 interval and the breaker open interval. In that case, the bulkhead will prevent further requests anyway. In the remaining time (if any) of the breaker open interval, we avoid queuing more work. t1 is showing the "half open timeout" logic.

What about the case when t0 < timeout?

When all tickets are taken, clearly that MySQL is a problem and we trigger circuits rapidly.

Downstream latency regression is one reason tickets can be exhausted, but upstream demand and small amounts of queuing can also cause this. If a large amount of load appears, filling tickets, we prematurely assume instability by shutting down the resource entirely. We fail requests that would not have failed.

When we trip circuits early based on bulkhead errors, we gain capacity for other work. If the other work can't use this capacity more effectively than the original work, we are making an error. This is the part that's difficult to understand: how likely is it that making capacity available for other work will have higher ROI given we just know a bulkhead is full? I don't have a good intuition for this.

Bulkheads can be tuned such that even when exhausted, there is sufficient capacity for other work, provided other competing bulkheads aren't concurrently exhausted. Is this not the way we use them?

Would sharing circuit state between processes be a more direct way of addressing the goals here without needing to understand the statistics of production failure modes?

csfrancis · 2019-06-13T19:02:52Z

My big concern with this implementation is the number of shared memory segments that will be created. It looks like we're using distinct shared memory segments for each instance of Simple::Integer and Simple::SlidingWindow. For Shopify production, I could see this amounting to (tens of?) thousands of shared memory segments. I can't say for certain that this would be a problem, but it feels a bit sketchy to me and it would be good to verify that it's not.

michaelkipper · 2019-06-13T21:50:47Z

My big concern with this implementation is the number of shared memory segments that will be created... For Shopify production, I could see this amounting to (tens of?) thousands of shared memory segments.

@csfrancis: The data I have for core suggests that it's hundreds, not thousands:
https://shopify.datadoghq.com/dashboard/mzg-id4-5rp/semian-lru?screenId=mzg-id4-5rp&screenName=semian-lru&tile_size=m&from_ts=1560376147986&to_ts=1560462547986&live=true&fullscreen_widget=812985505059370&fullscreen_section=overview

Our max_size LRU cache should put bounds on this.

sirupsen

Did a first high-level pass. :)

You describe the problem in the opening comment, but not why you think or know this approach will work. Why do we think this will solve it (Simulations proved it?)? What were alternatives considered? How will we prove whether it works?

sirupsen · 2019-06-25T12:32:58Z

README.md

+sensitive.
+
+To disable host-based circuits, set the environment variable
+`SEMIAN_CIRCUIT_BREAKER_IMPL` to `ruby`.


I think these are disabled by default for backwards compatibility, might be worth noting.

ext/semian/semian.c

sirupsen · 2019-06-25T12:36:26Z

test/simple_integer_test.rb

@@ -41,6 +42,24 @@ def test_reset
      @integer.reset
      assert_equal(0, @integer.value)
    end
+
+    if ENV['SEMIAN_CIRCUIT_BREAKER_IMPL'] != 'ruby'


An example of where you missed the ruby / worker duality.

Done. This test can run for both implementations.

sirupsen · 2019-06-25T12:44:40Z

README.md

+this configuration, we can reduce the time-to-open for a circuit from _E * T_
+to simply _T_ (provided that _N_ is greater than _E_).
+
+You should run a simulation with your workloads to determine an efficient


This is really tough advice to give someone. That can be days to weeks of work. Can we, with our simulation tooling, provide some sample values that we can include here? I imagine we can generate a somewhat magic constant based on simulations on (1) ss_error_rate steady-state error rates, (2) n number of Semian consumers (threads * workers), (3) error_threshold, (4) resource_timeout, (5) whatever other inputs you all use?

I'd like the advice to then be something along the lines of...

While your own simulation will likely find you a superior value for the scale factor, we recognize how much work this is. We've done simulations internally and have found that a scale factor that is `<magic constant> * <threads>` works fairly well and use that for almost all services.

sirupsen · 2019-06-25T13:56:51Z

ext/semian/semian.c

@@ -64,4 +66,30 @@ void Init_semian()

  /* Maximum number of tickets available on this system. */
  rb_define_const(cSemian, "MAX_TICKETS", INT2FIX(system_max_semaphore_count));
+
+  if (use_c_circuits()) {
+    Init_SimpleInteger();


Why did you choose to override them rather than define them and switch at the Ruby-layer instead? The latter seems a bit cleaner to me.

Mostly because that's roughly the way the bulkhead was implemented, and I wanted to keep the codebase familiar. One could argue that the Ruby implementation of the bulkhead is actually just a stub, but then those functions should have NotImplemented assertions in them.

Fine with me. 👍

sirupsen · 2019-06-25T13:59:12Z

ext/semian/simple_integer.c

+
+  dprintf("Initializing simple integer '%s' (key: %lu)", to_s(name), res->key);
+  res->sem_id = initialize_single_semaphore(res->key, SEM_DEFAULT_PERMISSIONS);
+  res->shmem = get_or_create_shared_memory(res->key, &init_fn);


Why did you decide on shared memory rather than the existing semaphore wrappers for storing this number? (No strong opinion, but curious).

There was a brief period where there was something else in the shared memory (I can't really remember now). There's no real issue with using a semaphore, and it would get rid of the need for locks, assuming we could guarantee that the operations would not block.

I'll think more about this when I review the actual code and not just the approach.

sirupsen · 2019-06-25T14:00:45Z

ext/semian/sliding_window.c

+}
+
+VALUE
+semian_simple_sliding_window_push(VALUE self, VALUE value)


Many other circuits just use a number and not a sliding window. What are the advantages of either approach? One huge advantage would be that you could just use the Integer class instead of a 450 line sliding window implementation.

To do circuit breaking without a sliding window, you'd have to use fixed time intervals for error counts, and reset the counter at the end of each fixed-time interval. There are several issues with this, off the top of my head:

That's a breaking change, compared to the current implementation.

You don't open circuits as fast, if a series of failure straddles a reset boundary. There can be cases when the circuit never opens at all (which you could argue is a benefit).

A master/slave election might be necessary to determine which worker resets the counter, which could be more complex than the sliding window implementation.

This approach is admittedly more complex than a single integer counter but I'm not sure that it's worth sacrificing the benefits of a sliding window.

It's hard to reason about the effect of this without millisecond-accurate production data.

Just to ensure we're on the same page, the alternative approach I'm considering is that you have two integers: (1) Number of failures, (2) Timestamp of when to reset the number of failures. This makes the code significantly simpler (in Ruby, it's a few lines, but it's quite a bit more here).

I changed this a few years ago in #24. It's actually addressing a separate point, which was false positives. I think the reason why you didn't arrive at that conclusion was that the previous solution would refresh the TTL every time it was modified.

Overall, likely the way we should look at circuits (I believe @hkdsun suggested this? Or maybe @csfrancis? Not my idea) is to talk about an "error percentage", i.e. if 10% of requests were errors in some window of time, open the circuits. That model is compatible with both, I think.

Again, I have not reviewed the code in detail here (just the approach), but it's a lot of code to maintain for what might appear to be a somewhat marginal advantage. I think my concern from 2015 was more a matter of how easy it was in that approach than real value.

@pushrax suggested this in our shared doc as well.

My primary motivation for host-based circuits is to address the variable latency problem, where a single worker doesn't see error rates above a threshold but the superposition of all errors on a host shows an anomaly.

I'm not sure it works without some sort of window, because in the implementation before #24, the previous errors would only be purged when there was a period of time since the last observed error with no new errors occurring. If the rate of new errors is greater than the error_timeout, then the pseudo-window would grow without bound and the circuit would eventually open. So if I'm reading that code correctly, the circuit opens quickly when there are a lot of errors, slowly-but-surely when there are some errors, and only remains closed when there are fewer than one error per error_timeout.

Using that sort of approach in a host-based circuit implementation likely regresses to the slowly-but-surely case, where the circuit constantly opens at a low frequency. Combined with capacity loss during the half_open_resource_timeout period, this would be an overall drop in available capacity.

michaelkipper · 2019-06-30T19:00:54Z

@csfrancis has some legitimate concerns about the number of shared memory segments this PR would create. SHMMNI is 4,096 on Linux so with our LRU max_size of 500, we'd need 1,500 of them.

Scott suggested storing all the sliding windows in a single shared memory segment, but I have concerns with this approach. Specifically, complexity that would be required to manage the indexing into that data structure. Right now, we generate a key and the IPC subsystem handles the lookup.

Another difficulty is cleanup. Given that we set IPC_RMID and call shmdt when a sliding window goes out of scope, we can defer cleanup to the kernel. If we had a single memory segment, we'd have to manage all that ourselves.

For SimpleInteger, the size of a shared memory segment on Linux is at least one page (4kB) so storing SimpleInteger in a shared memory segment is extremely inefficient. SEMMNI is 32,000 on Linux, so storing SimpleIntegers as a semaphore is a reasonable choice (as @sirupsen questioned earlier). I'll try and implement that ASAP.

michaelkipper · 2019-07-03T15:45:44Z

@csfrancis was concerned last week that the host-based circuits PR would require too many shared memory segments. The default on Linux is 4,096 and we were going to be using somewhere in the neightborhood of 1,500 with a max_size LRU of 500. He suggested using a single shared memory segment for all the circuit data.

I took a long stab at it, but ultimately concluded that it was too hard to do garbage collection. With semaphores, it's easy to do garbage collection with SEM_UNDO (in SysV IPC). With shared memory, we can set IPC_RMID to have the segment be destroyed after the last process detaches from it. But with a single segment for all the circuits, we'd have to manage garbage collection within the segment ourselves. I don't think it's impossible, but it adds a large amount of complexity to an already complex PR.

To alleviate concerns about the number of shared memory segments, I converted the Simple::Integer to use a semaphore, since there are 32k semaphore sets available, which leaves the number of shared memory segments required at 500 since only the SlidingWindow is using them.

As far as shipping this is concerned, I have https://github.com/Shopify/shopify/pull/206065 to bump core to use the new branch, with the previous behaviour. Then #243 adds support for an environment variable SEMIAN_CIRCUIT_BREAKER_FORCE_HOST which enables host-based circuits based on machine name (so we can enable it on a single node). Once we validate the behaviour on that node, we can move to a percentage rollout.

…etter-reject Don't force sliding window entries to be monotonic

…rcuits

csfrancis

I'm not finished reviewing this, but I'm leaving the comments that I've left so far.

csfrancis · 2019-06-28T17:07:15Z

ext/semian/sysv_shared_memory.c

+wait_for_shared_memory(uint64_t key)
+{
+  for (int i = 0; i < RETRIES; ++i) {
+    int shmid = shmget(key, SHM_DEFAULT_SIZE, SHM_DEFAULT_PERMISSIONS);


This retry logic is interesting - from looking at the docs, is there a case where shmget will fail and you expect that the retry will succeed?

The first try uses IPC_CREAT | IPC_EXCL. If that fails, then we assume it's because another process has created the segment. I think my intention was to check if that other process had run the shared_memory_init_fn but I've tried to optimize that out by building data structures that reset to zero'ed out memory.

I removed the wait loop - we can add that feature if we need it.

csfrancis · 2019-06-28T17:20:06Z

ext/semian/util.c

+  // or else sem_get will complain that we have requested an incorrect number of sems
+  // for the desired key, and have changed the number of semaphores for a given key
+  const int NUM_SEMAPHORES = 4;
+  sprintf(semset_size_key, "_NUM_SEMS_%d", NUM_SEMAPHORES);


I know you didn't add this, but this logic is just weird to me. If NUM_SEMAPHORES is a constant, why are we using sprintf here? Couldn't this be simplified to:

char semset_size_key[] = "_NUM_SEMS_4"; ... uniq_id_str = malloc(strlen(name) + strlen(semset_size_key) + 1); sprintf(uniq_id_str, "%s%s", name, semset_size_key);

Should also probably be checking for malloc failures and raising.

ext/semian/simple_integer.c

ext/semian/types.h

ext/semian/sliding_window.c

csfrancis · 2019-07-03T21:23:07Z

ext/semian/sliding_window.c

+static VALUE
+resize_window(int sem_id, semian_simple_sliding_window_shared_t* window, int new_max_size)
+{
+  if (new_max_size > SLIDING_WINDOW_MAX_SIZE) return Qnil;


Should this be raising?

ext/semian/sliding_window.c

csfrancis · 2019-07-03T21:43:28Z

ext/semian/sysv_semaphores.c


+  int sem_id = semget(key, 1, permissions);


Does this produce a warning? I thought that key_t was a 32-bit value and you've changed the type of 64 bits.

I didn't see one, and I compile with -Wall.
All the values are generated with:

key_t generate_key(const char *name);

I was following what was done in semian_resource_t.

…imple-integer Force-enable host-based circuits

michaelkipper added the enhancement label Jun 12, 2019

thegedge reviewed Jun 12, 2019

View reviewed changes

pushrax force-pushed the implement_lru_cache branch 2 times, most recently from db7215f to 0d55fb6 Compare June 14, 2019 20:26

michaelkipper force-pushed the implement_lru_cache branch from 43efef8 to 17e1477 Compare June 18, 2019 19:39

michaelkipper force-pushed the mkipper/global-circuit-breaker branch 2 times, most recently from c83159e to 1a46fb6 Compare June 18, 2019 20:54

michaelkipper requested a review from sirupsen June 24, 2019 12:33

sirupsen reviewed Jun 25, 2019

View reviewed changes

michaelkipper force-pushed the implement_lru_cache branch from 17e1477 to 8f2e24c Compare June 25, 2019 22:10

michaelkipper force-pushed the mkipper/global-circuit-breaker branch from 2edd17f to c840a9c Compare June 25, 2019 22:24

michaelkipper force-pushed the implement_lru_cache branch from 8f2e24c to c385e86 Compare June 26, 2019 16:09

Michael Kipper and others added 15 commits June 26, 2019 13:07

WIP: Move CircuitBreaker implementation to shared memory

e74c999

WIP: Move Simple::State to shared memory

e89c2c5

WIP: Working Simple::Integer and Simple::State

bb0e96e

WIP: Working Simple::SlidingWindow in shared memory

1d04c0c

Compile in C99 mode

3d78bbe

Fixes from code review

e20c078

Added a dprintf function

569a551

Write failing race test

f55e521

Allow fallback to Ruby to be toggled by env var

f517f75

Set up single semaphore

7c0fc80

Move default permissions into sysv_semaphore header

7325b47

Clarify critical sections

3c45b3f

Rename rb_val->retval

35217ae

Change flag for choosing circuit implementation

7ba1e9e

Clarify flag scope

8e64bd0

Michael Kipper and others added 9 commits June 26, 2019 13:20

Allow SEMIAN_DEBUG to enable debug messages

0ae9c55

Ignore log files

26aa0dd

Review comments

43f6a57

new build matrix

dafe362

make workdir similar to travis path

63495a9

Exclude running tests with host-based circuits for gemfiles

13763b1

fix typo

e2f8a5d

Ignore BeyondCompare merge files

efd72d0

Implement the scale_factor for host-based sliding windows

d16c6ab

michaelkipper force-pushed the mkipper/global-circuit-breaker branch from e58b147 to d16c6ab Compare June 26, 2019 21:12

Michael Kipper added 2 commits June 30, 2019 19:02

Enabled debug messages with SEMIAN_DEBUG

ddd7268

Store SimpleInteger in semaphores, not shared memory

0a00e94

michaelkipper mentioned this pull request Jul 1, 2019

Force-enable host-based circuits #243

Merged

Michael Kipper added 3 commits July 3, 2019 09:31

Print debug messages with timestamp and process ID

4682bed

Tests are too noisy

f43ccbf

Don't force sliding window entries to be monotonic

af854a5

michaelkipper mentioned this pull request Jul 3, 2019

Don't force sliding window entries to be monotonic #246

Merged

Michael Kipper added 5 commits July 3, 2019 15:19

Review comments

6445222

Merge pull request #246 from Shopify/mkipper/global-circuit-breaker-b…

3e04a43

…etter-reject Don't force sliding window entries to be monotonic

Allow SEMIAN_CIRCUIT_BREAKER_FORCE_HOST to force-enable host-based ci…

3655ae8

…rcuits

Build semian libraries before unit tests

926074d

Force circuits in Ruby

05796e7

csfrancis reviewed Jul 3, 2019

View reviewed changes

Michael Kipper added 5 commits July 3, 2019 21:18

Review fixes

be44b32

Review fixes

d3f252c

Review comments

ea78022

Merge pull request #243 from Shopify/mkipper/global-circuit-breaker-s…

9fb115a

…imple-integer Force-enable host-based circuits

Skip test

286b6c1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Share circuit breakers between workers #227

[DRAFT] Share circuit breakers between workers #227

michaelkipper commented Jun 12, 2019 •

edited

thegedge left a comment

thegedge Jun 12, 2019

michaelkipper Jun 12, 2019

pushrax commented Jun 12, 2019 •

edited

csfrancis commented Jun 13, 2019

michaelkipper commented Jun 13, 2019

sirupsen left a comment •

edited

sirupsen Jun 25, 2019

michaelkipper Jun 25, 2019

sirupsen Jun 25, 2019

michaelkipper Jun 25, 2019

sirupsen Jun 25, 2019

sirupsen Jun 25, 2019

michaelkipper Jun 25, 2019

sirupsen Jun 26, 2019

sirupsen Jun 25, 2019

michaelkipper Jun 25, 2019

sirupsen Jun 26, 2019

sirupsen Jun 25, 2019

michaelkipper Jun 25, 2019

sirupsen Jun 26, 2019 •

edited

michaelkipper Jun 27, 2019

michaelkipper commented Jun 30, 2019

michaelkipper commented Jul 3, 2019

csfrancis left a comment

csfrancis Jun 28, 2019

michaelkipper Jul 4, 2019

csfrancis Jun 28, 2019

csfrancis Jul 3, 2019

michaelkipper Jul 4, 2019

csfrancis Jul 3, 2019

michaelkipper Jul 4, 2019

[DRAFT] Share circuit breakers between workers #227

Are you sure you want to change the base?

[DRAFT] Share circuit breakers between workers #227

Conversation

michaelkipper commented Jun 12, 2019 • edited

What

Why

How

thegedge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pushrax commented Jun 12, 2019 • edited

csfrancis commented Jun 13, 2019

michaelkipper commented Jun 13, 2019

sirupsen left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sirupsen Jun 26, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelkipper commented Jun 30, 2019

michaelkipper commented Jul 3, 2019

csfrancis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelkipper commented Jun 12, 2019 •

edited

pushrax commented Jun 12, 2019 •

edited

sirupsen left a comment •

edited

sirupsen Jun 26, 2019 •

edited