Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Cluster V2 Discussion #384

Open
PingXie opened this issue Apr 25, 2024 · 9 comments
Open

[NEW] Cluster V2 Discussion #384

PingXie opened this issue Apr 25, 2024 · 9 comments

Comments

@PingXie
Copy link
Member

PingXie commented Apr 25, 2024

Here are the problems that I think we need to solve in the current cluster:

  1. strong consistency (for cluster topology)

cluster topology is concerned with which nodes own which slots and primaryship. The current cluster implementation is not even eventually consistent by design because there are places where node epochs are bumped without consensus (trade-offs). This leads to increased complexity on the client side.

  1. better manageability (of global config/data)

This particular issue provides the exact context on this pain point

  1. more resilience (to stressful client workload)

Today, both the cluster bus and the client workload run on the same main thread. So a demanding client workload has the potential to starve the cluster bus and leads to unnecessary failover.

  1. higher scale

The V1 cluster is a mesh so the cluster gossip traffic is proportional to N^2, where N is the (data) nodes in the cluster. The practical limit of a V1 cluster is ~500 nodes.

Originally posted by @PingXie in #58 (comment)

@artikell
Copy link
Contributor

I understand that the issue of strong consistency&better management capability seems to be more caused by large-scale decentralized architecture. One idea is to consider using Redis Sentinel to manage Redis clusters. Control meta information and highly available operations within a small range of nodes.

Regarding the issue of more resilience, I have noticed a previous issue: redis/redis#10878 Perhaps @madolson has some work to do

Regarding the design of a higher scale, an excessively large cluster size can lead to an increase in the number of connections and a decrease in performance.
This can consider whether the node supports larger specifications to enhance the capacity of the entire cluster. So, forkless is also a point that needs attention.

@hpatro
Copy link
Contributor

hpatro commented Apr 29, 2024

There were few more points which I feel should be addressed as part of the new design which I brought up here, #58 (comment) Reposting my thoughts here (maybe we can merge things into the top level comment)

  • Support higher Scaling (Larger cluster size) - Currently Valkey cluster can't scale beyond 500 nodes beyond which the gossip protocol doesn't scale well. The new architecture should strive to handle much larger cluster size.
  • Support smaller cluster size - With the current design, only primary nodes can vote hence the ideal cluster would require atleast 3 primaries to form a quorum. The new design should take into consideration where each node (primary/replica) should be part of the quorum.
  • Centralized metadata store - Currently config, functions, acl needs to be applied on each node. This is painful and cumbersome during cluster setup as well as during scale out. Building a mechanism to be able to send those information once and being applied to all of the nodes would be ideal.
  • Decouple control and data transfer via cluster bus - Currently, the clusterbus carries both control data (health info, gossip info, etc) as well as pubsub data. With high pubsub data, the clusterbus can get overwhelmed and cause issue with the cluster health status and can delay cluster topology information consistency in the system.
  • Lower failover time - Due to the primary node only consensus mechanism in the current approach, failover in a shard can be delayed if there is a network partition and other primaries aren't reachable.

@hwware
Copy link
Member

hwware commented May 1, 2024

I am not sure if in cluster v2, is there plan to remove sentinel node?

@madolson
Copy link
Member

madolson commented May 1, 2024

I am not sure if in cluster v2, is there plan to remove sentinel node?

My directional long term ask would be to merge them together. It remains API compatible but becomes a special case instead of a different deployment mode.

@PingXie
Copy link
Member Author

PingXie commented May 1, 2024

The way I see it, a bit philosophically, the operational convenience from the coupling of cluster management with data access comes at the cost of complexity, reliability and scalability. In this sense, a big part of the cluster V2 aspiration is to go back to the Sentinel architecture and decouple the two so there is the chance of merging the two (sentinel and cluster v2), conceptually speaking. However, I would also be interested in retaining the existing operational experience.

@madolson
Copy link
Member

madolson commented May 1, 2024

The one thing I don't want to retain with sentinel is that I don't want there to necessarily be a distinct "sentinel" nodes. The pathway for other projects like kafka is there is internal control nodes, but they are organized in the same cluster, so more transparent to users. If you think about it from kubernetes deployment, we want to deploy 1 cluster that is able to internally handle failovers. I think that was one of the things I disliked about the Redis ltd., which is they wanted to force users to understand the TD and FC concepts.

@madolson
Copy link
Member

madolson commented May 1, 2024

Here was my original list (removing the features that have been implemented)

Improved use case support

This pillar focuses on providing improved functionality outside of the core cluster code but helps improve the usability of cluster mode.

  • Clusterbus as HA for single shard:
    Allows the clusterbus to replace sentinel as the HA mechanism for Redis. This will require voting replicas which is dicussed later.
    Add specification for a new cluster implementation. redis/redis#10875

  • Custom hashing support:
    Some applications want to have their own mechanism for determining slots, so we should extend the hashtag semantics to include information about what slot the request is intended for.

  • Hashtag scanning/atomic deletion:
    A common ask has been for being able to use scan like commands to find elements in a hashtag without having to scan the entire keyspace. A proposal is to be able to create a group of keys that can be atomically deleted. A secondary index could also solve this issue.
    (I'm sure there is an issue for this, I'll find it)

Cluster management improvements

This pillar focuses around improving the ease of use for managing Redis clusters.

  • Consensus based + Atomic slot migration:
    Implement a server based slot migration command that migrates the data from one slot to another slot. (We have a solution we hopefully will someday post for this)
    "CLUSTER MIGRATE slot node" command redis/redis#2807

  • Improved metrics for slot performance:
    Add metrics for individual slot performance to make decisions about hot shards/keys. ** This makes it easier to identify slots that should be moved. Easy metrics to grab our key accesses, ideally memory would be better but that's hard.

  • Dynamic slot ownership
    For all master clusters in caching based used cases, its data durability is not needed and nodes in a cluster can simply take over slots from other nodes when a node dies. Adding nodes can also mean that it will automatically takeover slot ownership from other nodes.
    [Cluster] "Cache-Only" mode redis/redis#4160

  • Auto scaling
    Support automatic rebalancing of clusters when adding nodes/removing nodes as well as during steady state when there is traffic load mismatch.
    Redis Cluster Auto Scaling - Requirements redis/redis#3009

  • Moving cluster bus to a separate thread, improved reliability in case of busy server
    Today if the main thread is busy it main not respond to a health check ping even though it is still up and healthy. Refactoring the clusterbus onto its own thread will make it more responsive.

  • Refactor abstractions in cluster.c:
    Several abstractions in cluster.c are hard to follow and should be broken up including: Cluster bus and node handling, slot awareness, health monitoring.

  • Human readable names for nodes:
    Today individual Redis nodes report their hexadecimal names, which are not human readable. Instead we should additionally assign them some more readable name that is either logical or corresponds to their primary.
    Cluster human readable nodename feature  redis/redis#9564

  • *Module support for different consensus algorithms *
    Today Redis only supports the clusterbus as a consensus algorithm, but we could also support module hooks for other forms of consensus.

Cluster HA improvements

This pillar focuses on improving the high availability aspects of Redis cluster and focuses around improving failover and health checks.

  • Reduce messages sent for node health decisions:
    The Redis clusterbus has an NxN full mesh of gossip health messages. This can cause performance degradation and instability in large clusters as health and voting authorization is slow. There are several ways to solve this such as having failovers be shard local or being smarter about propagation of information.
    Redis Cluster: reduce gossip messages total traffic redis/redis#3929

  • Voting replicas: (group this with other conensus ones)
    Today replicas don’t take part in leader election, this would be useful for smaller cluster sizes especially single shards.
    Add specification for a new cluster implementation. redis/redis#10875

  • Avoiding cascading failovers leading to data loss:
    It's possible that a replica without data can be promoted to be the master role and lost all data in the shard. This is typically the result of a cascading failover. Ideally we should add a stopgap here to prevent this last node from being demoted.

  • Placement awareness:
    Today the individual nodes have no concept of how they are placed compared to each other, and will happily allow all the primaries to exist in the same zone. This also may include the notion of multi-region awareness.

  • RESPV3 topology updates
    Today clusters come to learn about topology changes when they send a request to the wrong node. This can be limited by having nodes proactively notify clients when a topology change has occurred. This can be inefficient since today clients need to call CLUSTER SLOTS to re-learn the entire topology. A client can opt into topology changes, and from that point on it will receive information about just what topology has changed.
    CLUSTER SUBSCRIBE SLOTS (topology changes) redis/redis#10150

@madolson
Copy link
Member

madolson commented May 1, 2024

@PingXie What do you want to get consensus on here? At it's core, I think the next step is we need to have someone come up with a concrete design. Independently, I also want us to finish the module interface for cluster so that we can work on that as a module that can be tested and folks can opt-in to it during the lifecycle of Valkey 8 and we can GA it in Valkey 9.

@PingXie
Copy link
Member Author

PingXie commented May 1, 2024

@PingXie What do you want to get consensus on here?

The value proposition, aka the "why" question. I consider this thread to be more of an open-ended discussion for the broader community.

Independently, I also want us to finish the module interface for cluster so that we can work on that as a module that can be tested and folks can opt-in to it during the lifecycle of Valkey 8 and we can GA it in Valkey 9.

Agreed. I think it is wise to avoid coupling whenever possible. "modularization of the cluster management logic" is a good thing on its own and practically speaking it is actually a half-done job already. I don't like where we are and I think we should just go ahead and finish it properly.

At it's core, I think the next step is we need to have someone come up with a concrete design.

I am onboard with that and I can see it being a parallel thread to the "why" discussion (this thread).

How about we break this topic into three issues/discussoins?

  1. the "value" discussion can stay on this thread
  2. we can create a new/more concrete task to track the modularization work
  3. we can also deep dive into a strawman design proposal for cluster v2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants