[NEW] Cluster V2 Discussion #384

PingXie · 2024-04-25T23:19:38Z

Here are the problems that I think we need to solve in the current cluster:

strong consistency (for cluster topology)

cluster topology is concerned with which nodes own which slots and primaryship. The current cluster implementation is not even eventually consistent by design because there are places where node epochs are bumped without consensus (trade-offs). This leads to increased complexity on the client side.

better manageability (of global config/data)

This particular issue provides the exact context on this pain point

more resilience (to stressful client workload)

Today, both the cluster bus and the client workload run on the same main thread. So a demanding client workload has the potential to starve the cluster bus and leads to unnecessary failover.

higher scale

The V1 cluster is a mesh so the cluster gossip traffic is proportional to N^2, where N is the (data) nodes in the cluster. The practical limit of a V1 cluster is ~500 nodes.

Originally posted by @PingXie in #58 (comment)

artikell · 2024-04-26T05:27:48Z

I understand that the issue of strong consistency&better management capability seems to be more caused by large-scale decentralized architecture. One idea is to consider using Redis Sentinel to manage Redis clusters. Control meta information and highly available operations within a small range of nodes.

Regarding the issue of more resilience, I have noticed a previous issue: redis/redis#10878 Perhaps @madolson has some work to do

Regarding the design of a higher scale, an excessively large cluster size can lead to an increase in the number of connections and a decrease in performance.
This can consider whether the node supports larger specifications to enhance the capacity of the entire cluster. So, forkless is also a point that needs attention.

hpatro · 2024-04-29T23:11:21Z

There were few more points which I feel should be addressed as part of the new design which I brought up here, #58 (comment) Reposting my thoughts here (maybe we can merge things into the top level comment)

Support higher Scaling (Larger cluster size) - Currently Valkey cluster can't scale beyond 500 nodes beyond which the gossip protocol doesn't scale well. The new architecture should strive to handle much larger cluster size.
Support smaller cluster size - With the current design, only primary nodes can vote hence the ideal cluster would require atleast 3 primaries to form a quorum. The new design should take into consideration where each node (primary/replica) should be part of the quorum.
Centralized metadata store - Currently config, functions, acl needs to be applied on each node. This is painful and cumbersome during cluster setup as well as during scale out. Building a mechanism to be able to send those information once and being applied to all of the nodes would be ideal.
Decouple control and data transfer via cluster bus - Currently, the clusterbus carries both control data (health info, gossip info, etc) as well as pubsub data. With high pubsub data, the clusterbus can get overwhelmed and cause issue with the cluster health status and can delay cluster topology information consistency in the system.
Lower failover time - Due to the primary node only consensus mechanism in the current approach, failover in a shard can be delayed if there is a network partition and other primaries aren't reachable.

hwware · 2024-05-01T14:44:13Z

I am not sure if in cluster v2, is there plan to remove sentinel node?

madolson · 2024-05-01T16:37:33Z

I am not sure if in cluster v2, is there plan to remove sentinel node?

My directional long term ask would be to merge them together. It remains API compatible but becomes a special case instead of a different deployment mode.

PingXie · 2024-05-01T17:57:00Z

The way I see it, a bit philosophically, the operational convenience from the coupling of cluster management with data access comes at the cost of complexity, reliability and scalability. In this sense, a big part of the cluster V2 aspiration is to go back to the Sentinel architecture and decouple the two so there is the chance of merging the two (sentinel and cluster v2), conceptually speaking. However, I would also be interested in retaining the existing operational experience.

madolson · 2024-05-01T19:19:08Z

The one thing I don't want to retain with sentinel is that I don't want there to necessarily be a distinct "sentinel" nodes. The pathway for other projects like kafka is there is internal control nodes, but they are organized in the same cluster, so more transparent to users. If you think about it from kubernetes deployment, we want to deploy 1 cluster that is able to internally handle failovers. I think that was one of the things I disliked about the Redis ltd., which is they wanted to force users to understand the TD and FC concepts.

madolson · 2024-05-01T19:22:14Z

Here was my original list (removing the features that have been implemented)

Improved use case support

This pillar focuses on providing improved functionality outside of the core cluster code but helps improve the usability of cluster mode.

Clusterbus as HA for single shard:
Allows the clusterbus to replace sentinel as the HA mechanism for Redis. This will require voting replicas which is dicussed later.
Add specification for a new cluster implementation. redis/redis#10875
Custom hashing support:
Some applications want to have their own mechanism for determining slots, so we should extend the hashtag semantics to include information about what slot the request is intended for.
Hashtag scanning/atomic deletion:
A common ask has been for being able to use scan like commands to find elements in a hashtag without having to scan the entire keyspace. A proposal is to be able to create a group of keys that can be atomically deleted. A secondary index could also solve this issue.
(I'm sure there is an issue for this, I'll find it)

Cluster management improvements

This pillar focuses around improving the ease of use for managing Redis clusters.

Consensus based + Atomic slot migration:
Implement a server based slot migration command that migrates the data from one slot to another slot. (We have a solution we hopefully will someday post for this)
"CLUSTER MIGRATE slot node" command redis/redis#2807
Improved metrics for slot performance:
Add metrics for individual slot performance to make decisions about hot shards/keys. ** This makes it easier to identify slots that should be moved. Easy metrics to grab our key accesses, ideally memory would be better but that's hard.
Dynamic slot ownership
For all master clusters in caching based used cases, its data durability is not needed and nodes in a cluster can simply take over slots from other nodes when a node dies. Adding nodes can also mean that it will automatically takeover slot ownership from other nodes.
[Cluster] "Cache-Only" mode redis/redis#4160
Auto scaling
Support automatic rebalancing of clusters when adding nodes/removing nodes as well as during steady state when there is traffic load mismatch.
Redis Cluster Auto Scaling - Requirements redis/redis#3009
Moving cluster bus to a separate thread, improved reliability in case of busy server
Today if the main thread is busy it main not respond to a health check ping even though it is still up and healthy. Refactoring the clusterbus onto its own thread will make it more responsive.
Refactor abstractions in cluster.c:
Several abstractions in cluster.c are hard to follow and should be broken up including: Cluster bus and node handling, slot awareness, health monitoring.
Human readable names for nodes:
Today individual Redis nodes report their hexadecimal names, which are not human readable. Instead we should additionally assign them some more readable name that is either logical or corresponds to their primary.
Cluster human readable nodename feature redis/redis#9564
*Module support for different consensus algorithms *
Today Redis only supports the clusterbus as a consensus algorithm, but we could also support module hooks for other forms of consensus.

Cluster HA improvements

This pillar focuses on improving the high availability aspects of Redis cluster and focuses around improving failover and health checks.

Reduce messages sent for node health decisions:
The Redis clusterbus has an NxN full mesh of gossip health messages. This can cause performance degradation and instability in large clusters as health and voting authorization is slow. There are several ways to solve this such as having failovers be shard local or being smarter about propagation of information.
Redis Cluster: reduce gossip messages total traffic redis/redis#3929
Voting replicas: (group this with other conensus ones)
Today replicas don’t take part in leader election, this would be useful for smaller cluster sizes especially single shards.
Add specification for a new cluster implementation. redis/redis#10875
Avoiding cascading failovers leading to data loss:
It's possible that a replica without data can be promoted to be the master role and lost all data in the shard. This is typically the result of a cascading failover. Ideally we should add a stopgap here to prevent this last node from being demoted.
Placement awareness:
Today the individual nodes have no concept of how they are placed compared to each other, and will happily allow all the primaries to exist in the same zone. This also may include the notion of multi-region awareness.
RESPV3 topology updates
Today clusters come to learn about topology changes when they send a request to the wrong node. This can be limited by having nodes proactively notify clients when a topology change has occurred. This can be inefficient since today clients need to call CLUSTER SLOTS to re-learn the entire topology. A client can opt into topology changes, and from that point on it will receive information about just what topology has changed.
CLUSTER SUBSCRIBE SLOTS (topology changes) redis/redis#10150

madolson · 2024-05-01T19:24:50Z

@PingXie What do you want to get consensus on here? At it's core, I think the next step is we need to have someone come up with a concrete design. Independently, I also want us to finish the module interface for cluster so that we can work on that as a module that can be tested and folks can opt-in to it during the lifecycle of Valkey 8 and we can GA it in Valkey 9.

PingXie · 2024-05-01T21:00:22Z

@PingXie What do you want to get consensus on here?

The value proposition, aka the "why" question. I consider this thread to be more of an open-ended discussion for the broader community.

Independently, I also want us to finish the module interface for cluster so that we can work on that as a module that can be tested and folks can opt-in to it during the lifecycle of Valkey 8 and we can GA it in Valkey 9.

Agreed. I think it is wise to avoid coupling whenever possible. "modularization of the cluster management logic" is a good thing on its own and practically speaking it is actually a half-done job already. I don't like where we are and I think we should just go ahead and finish it properly.

At it's core, I think the next step is we need to have someone come up with a concrete design.

I am onboard with that and I can see it being a parallel thread to the "why" discussion (this thread).

How about we break this topic into three issues/discussoins?

the "value" discussion can stay on this thread
we can create a new/more concrete task to track the modularization work
we can also deep dive into a strawman design proposal for cluster v2.

madolson mentioned this issue May 1, 2024

Redis Cluster V2 project redis/redis#8948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Cluster V2 Discussion #384

[NEW] Cluster V2 Discussion #384

PingXie commented Apr 25, 2024

artikell commented Apr 26, 2024

hpatro commented Apr 29, 2024

hwware commented May 1, 2024

madolson commented May 1, 2024

PingXie commented May 1, 2024 •

edited

madolson commented May 1, 2024

madolson commented May 1, 2024

madolson commented May 1, 2024

PingXie commented May 1, 2024

[NEW] Cluster V2 Discussion #384

[NEW] Cluster V2 Discussion #384

Comments

PingXie commented Apr 25, 2024

artikell commented Apr 26, 2024

hpatro commented Apr 29, 2024

hwware commented May 1, 2024

madolson commented May 1, 2024

PingXie commented May 1, 2024 • edited

madolson commented May 1, 2024

madolson commented May 1, 2024

Improved use case support

Cluster management improvements

Cluster HA improvements

madolson commented May 1, 2024

PingXie commented May 1, 2024

PingXie commented May 1, 2024 •

edited