Allow clients to subscribe to slot migrations #298

zuiderkwast · 2024-04-11T15:53:14Z

IMHO, this is the most important event for cluster clients to be able to subscribe to. Why? In a scaling-and-balancing scenario, many slots can be moved. They are moved one by one. If a client performs a slot mapping update (e.g. calls CLUSTER SLOTS) every time it receives a -MOVED redirect, it will need to do this many times if it gets a MOVED redirect after each migrated slot.

Updating the slot mapping at every MOVED redirect is a recommended behavior according to the cluster spec, so clients do that:

An alternative is to just refresh the whole client-side cluster layout using the CLUSTER SHARDS, or the deprecated CLUSTER SLOTS, command when a MOVED redirection is received. When a redirection is encountered, it is likely multiple slots were reconfigured rather than just one, so updating the client configuration as soon as possible is often the best strategy.

At least for long-lived connections, this is a useful thing to do.

The cluster bus doesn't distinguish between migrations and other actions. It just sends the slot bitmap per node. This feature detects a migration by checking if exactly one slot has a new owner.

Only one moved slot per cluster bus message or command is notified. This is to avoid flooding the clients. If more slots are moved at the same time, such as at failovers, clients will need to handle MOVED redirects and update the slot mapping accordingly. What a node can detect is the modification of slot ownerships and addition/removal of replicas.

A special pubsub channel __cluster__:moved is used (naming inspired by client-side caching). Since the payload of pubsub messages are strings, it is encoded as a string on the form "MOVED slot endpoint:port", just like a MOVED redirects.

Some special logic added to avoid creating the pubsub message if there are no subscribers, and to adjust the port (TLS or non-TLS) to the receiver's connection.

Fixes #57.

Future improvements:

Possiblity to be notified about addition/removal of replicas (without any changed slot ownership)
Possiblity to be notified about failovers, i.e. multiple slots changed owner.

If we go ahead with __cluster__:moved only for individual slot migrations, we can do the future notifications in different channels.

The larger topology changes can't be communicated in the notification itself. The client will need to fetch the topology again using e.g. CLUSTER SHARDS.

The cluster bus doesn't distinguish between migrations and other actions. This feature detects a migration by checking if exactly one slot has a new owner. Only one moved slot per cluster bus message or command is notified. This is to avoid flooding the clients. In other cases, such as failovers, clients will need to handle MOVED redirects and update the slot mapping accordingly. A special pubsub channel `__cluster__:moved` is used (naming inspired by client-side caching). Since the payload of pubsub messages are strings, it is encoded as a string on the form "MOVED slot endpoint:port", just like a MOVED redirects. Some special logic added to avoid creating the pubsub message if there are no subscribers, and to adjust the port (TLS or non-TLS) to the receiver's connection. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

madolson

Overall I like the approach and idea. I'm not convinced we should keep the -MOVED syntax though. I wonder if it the format should just be __cluster__:moved SLOT <nodeid>. I don't have strong opinions about this though.

madolson · 2024-04-11T20:31:34Z

src/cluster_legacy.c

@@ -4844,17 +4851,43 @@ void clusterCron(void) {
        clusterUpdateState();
 }

+/* Notify clients subscribed to slot moved events. */
+void clusterNotifyMovedSlot(int moved_slot, list *clients) {


It feels like this should be cluster.h, it seems like all cluster implementations would want to send this type of notification.

The function is using clusterNode which is only in cluster_legacy.[ch].

This separation of cluster and cluster_legacy is quite arbitrary. I don't mind that you fix it or we can just merge the two again. Then I'll rebase this PR. :)

Do you have a better idea?

src/cluster_legacy.c

madolson · 2024-04-11T20:40:38Z

src/cluster.c

+/* For redirects, verb must start with a dash, e.g. "-ASK" or "-MOVED". */
+sds clusterFormatRedirect(const char *verb, int slot, clusterNode *n, int use_tls_port) {
+    const char *endpoint = clusterNodePreferredEndpoint(n);
+    int port = clusterNodeClientPort(n, use_tls_port);
+    return sdscatprintf(sdsempty(), "%s %d %s:%d", verb, slot, endpoint, port);
+}
+


Are there any clients that primarily store a map of NodeID -> Nodes as opposed to endpoint:port -> Nodes? I ask because I'm wondering if it would be useful to also return the NodeID here as well. I know python doesn't, but I'm less familiar with the other clients, but if there are nodes that don't have the main node map key'd off the endpoints, then maybe it would be easier for them.

I don't know, but since redirects use the host:port form, clients need to be able to identify them by this.

zuiderkwast · 2024-04-11T23:49:28Z

I wonder if it the format should just be __cluster__:moved SLOT <nodeid>.

What do you mean by that? One string? Which one is the channel and which one is the message?

The channel is a string and the message is another string. Do you mean the message is "slot nodeid"? This would work (though I prefer host:port because clients need to use that for redirects anyway). It depends on if we call the channel something else, like just __cluster__ or __cluster__:topology, then i'd like MOVED to be part of the message.

If you want to return something else, like an array or map, we can't use pubsub. We could use RESP3 push though and require RESP3 for this feature.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

zuiderkwast · 2024-04-14T11:19:29Z

When thinking about client-side caching, I realized the invalidate message looks different depending on RESP version. For RESP2 it's a pubsub message on the form ["message", "__redis__:invalidate", "foo"] but in RESP3 it's just a push message on the 2-element form ["invalidate", ["foo"]].

Pubsub is messy for various reasons (you don't get a proper reply to SUBSCRIBE, these special channels do not propagate in a cluster in the same way normal channels, etc.) so maybe we should just require RESP3 for this and use a new command like CLUSTER SUBCRIBE-MOVED or something like that? The commands gets a proper reply "+OK" and the push messages can be structured properly, like ["moved", 1234, "example.com", 6379]. WDYT?

madolson · 2024-04-14T20:11:10Z

What do you mean by that? One string? Which one is the channel and which one is the message?

Sorry. The channel name you proposed was fine __cluster__:moved but I was questioning if we needed to have the remaining string be consistent with the moved response. -MOVED is redundant, we already know it is a moved request, so I think we could drop it. One thing I don't like about the current -MOVED messages is that they are dependent on the client TLS state. It also doesn't cover the other major topology change, which is failovers. I think we could just come up with a new syntax that is easy for clients to read. __cluster__:moved + <slot id> <node id>, should be enough for clients to update their internal structures. If they don't know the node, they can re-discover the topology.

New command is interesting. I really think we should take ownership of a client to be able to play around with the complexity of implementing this suggestion in the client.

zuiderkwast · 2024-04-14T22:21:13Z

I can implement it in hiredis-cluster and ered. It should be fairly easy.

I can probably implement failover notification too, though it's a bit harder to detect and a bit less important imho for clients.

enjoy-binbin

LGTM.

So we recommend that the client subscribes to this channel. When MOVED occurs, the client can directly update the mapping information of that single slot (instead of the whloe slots mapping), right?

I'm wondering if we should send a special message for CLUSTER_MOVED_SLOT_MULTIPLE so that the client can actively update the mapping (before the MOVED error)?

Another way is, can we send channel messages directly in clusterAddSlot? I think the overhead may not be that big, so that for each slot MOVED, we can just send the message so we don't need to bother to handle CLUSTER_MOVED_SLOT_MULTIPLE.

zuiderkwast · 2024-04-19T19:24:03Z

So we recommend that the client subscribes to this channel. When MOVED occurs, the client can directly update the mapping information of that single slot (instead of the whloe slots mapping), right?

@enjoy-binbin That's right.

I'm wondering if we should send a special message for CLUSTER_MOVED_SLOT_MULTIPLE so that the client can actively update the mapping (before the MOVED error)?

Yes, it's probably a good idea.

Another way is, can we send channel messages directly in clusterAddSlot? I think the overhead may not be that big, so that for each slot MOVED, we can just send the message so we don't need to bother to handle CLUSTER_MOVED_SLOT_MULTIPLE.

In a cluster with 3 shards, if there is a failover, 5000 slots move immediately. We have 5000 messages to every client. I think it's too much.

If we notify like "MOVED MULTIPLE", then the client can reload the mapping, but if all clients are notified at the same time, all clients will reload it at the same time. Can this be a problem?

If we detect that all slots from one node has moved to another node which was previously a replica (i.e. we detect that there was a failover) then we can send a special message for this like "MOVED ALL FROM ip1:port1 TO ip2:port2" (or "FAILOVER ip1:port1 TO ip2:port2"). This will let the client update the mapping without reloading it from the server. I think it can be good for clients. Maybe it's hard to implement it? I don't know...

barshaul · 2024-04-21T05:57:49Z

Sorry. The channel name you proposed was fine __cluster__:moved but I was questioning if we needed to have the remaining string be consistent with the moved response. -MOVED is redundant, we already know it is a moved request, so I think we could drop it. One thing I don't like about the current -MOVED messages is that they are dependent on the client TLS state. It also doesn't cover the other major topology change, which is failovers. I think we could just come up with a new syntax that is easy for clients to read. __cluster__:moved + <slot id> <node id>, should be enough for clients to update their internal structures. If they don't know the node, they can re-discover the topology.

I haven't encountered a client that stores the node-id directly. If they do store it, it's usually as an additional parameter rather than a hashable key. Most clients store nodes as addr:port. If you only provide the node-id, it would necessitate further modifications to most clients to establish a mapping from node-id to the node itself. I believe returning cluster:moved followed by addr:port, similar to what clients used to receive in the MOVED error, would be the optimal approach.

enjoy-binbin · 2024-04-24T08:59:45Z

In a cluster with 3 shards, if there is a failover, 5000 slots move immediately. We have 5000 messages to every client. I think it's too much.

If we notify like "MOVED MULTIPLE", then the client can reload the mapping, but if all clients are notified at the same time, all clients will reload it at the same time. Can this be a problem?

yes, considering we may have thousands of clients, i guess it is too much and will be a problem. But I feel like CLUSTER_MOVED_SLOT_MULTIPLE seems to be more common? After all we should rarely move just one slot in a change.

If we detect that all slots from one node has moved to another node which was previously a replica (i.e. we detect that there was a failover) then we can send a special message for this like "MOVED ALL FROM ip1:port1 TO ip2:port2" (or "FAILOVER ip1:port1 TO ip2:port2"). This will let the client update the mapping without reloading it from the server. I think it can be good for clients. Maybe it's hard to implement it? I don't know...

this seems like a good idea in failover case, i think we can take a try, it doesn’t seem difficult to implement

zuiderkwast · 2024-04-24T10:18:51Z

yes, considering we may have thousands of clients, i guess it is too much and will be a problem. But I feel like CLUSTER_MOVED_SLOT_MULTIPLE seems to be more common? After all we should rarely move just one slot in a change.

No, in slot migration, you only move one slot at a time. When one slot is finished, the next slot is moved. With this feature, the clients will get notification for one slot after each slot is migrated. I described this in the PR above.

this seems like a good idea in failover case, i think we can take a try, it doesn’t seem difficult to implement

Maybe you can help me. :) My plan is to come back to this PR soon, when rebranding docs is finished.

enjoy-binbin · 2024-04-24T11:20:36Z

No, in slot migration, you only move one slot at a time. When one slot is finished, the next slot is moved. With this feature, the clients will get notification for one slot after each slot is migrated. I described this in the PR above.

ohh, sorry that is right, i got it mixed up. (In Tencent Cloud we are moving a batch of slots in one change.)

Maybe you can help me. :) My plan is to come back to this PR soon, when rebranding docs is finished.

:) I've been busy lately, it may take me some time to get back involved

madolson added the major-decision-pending Needs decision by core team label Apr 11, 2024

madolson reviewed Apr 11, 2024

View reviewed changes

Use other magic numbers for the CLUSTER_MOVED_SLOT constants

1cd6dda

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

zuiderkwast force-pushed the subscribe-cluster-topology branch from 92b5db0 to 1cd6dda Compare April 12, 2024 12:52

zuiderkwast added the cluster label Apr 12, 2024

zuiderkwast requested review from PingXie and enjoy-binbin April 15, 2024 15:09

PingXie mentioned this pull request Apr 15, 2024

[NEW] Atomic slot migration HLD #23

Open

enjoy-binbin reviewed Apr 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow clients to subscribe to slot migrations #298

Allow clients to subscribe to slot migrations #298

zuiderkwast commented Apr 11, 2024

madolson left a comment

madolson Apr 11, 2024

zuiderkwast Apr 11, 2024

madolson Apr 11, 2024

zuiderkwast Apr 11, 2024

zuiderkwast commented Apr 11, 2024 •

edited

zuiderkwast commented Apr 14, 2024

madolson commented Apr 14, 2024

zuiderkwast commented Apr 14, 2024

enjoy-binbin left a comment

zuiderkwast commented Apr 19, 2024

barshaul commented Apr 21, 2024

enjoy-binbin commented Apr 24, 2024

zuiderkwast commented Apr 24, 2024

enjoy-binbin commented Apr 24, 2024

Allow clients to subscribe to slot migrations #298

Are you sure you want to change the base?

Allow clients to subscribe to slot migrations #298

Conversation

zuiderkwast commented Apr 11, 2024

madolson left a comment

Choose a reason for hiding this comment

madolson Apr 11, 2024

Choose a reason for hiding this comment

zuiderkwast Apr 11, 2024

Choose a reason for hiding this comment

madolson Apr 11, 2024

Choose a reason for hiding this comment

zuiderkwast Apr 11, 2024

Choose a reason for hiding this comment

zuiderkwast commented Apr 11, 2024 • edited

zuiderkwast commented Apr 14, 2024

madolson commented Apr 14, 2024

zuiderkwast commented Apr 14, 2024

enjoy-binbin left a comment

Choose a reason for hiding this comment

zuiderkwast commented Apr 19, 2024

barshaul commented Apr 21, 2024

enjoy-binbin commented Apr 24, 2024

zuiderkwast commented Apr 24, 2024

enjoy-binbin commented Apr 24, 2024

zuiderkwast commented Apr 11, 2024 •

edited