Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for balancing a single table at a time to the host regex balancer #4521

Open
keith-turner opened this issue May 3, 2024 · 5 comments
Assignees
Labels
enhancement This issue describes a new feature, improvement, or optimization.

Comments

@keith-turner
Copy link
Contributor

Is your feature request related to a problem? Please describe.

The host regex balancer calls multiple per table balancers. Each per table balancer may make balancing decision based on the current state of its tablets and the current state of all other tablets. When the decision of multiple per table balancers are all executed at once it may cause churn because their assumptions about other tables are partially invalidated. If the host regex balancer could optionally focus on one table at a time it may help avoid or lessen this churn.

Describe the solution you'd like

Add option to the host reg ex balancer that enables balancing a single table at time. This option would be off by default. The behavior of this option would be as follows for this code section.

  • If there are migrations in progress let them complete, so return.
  • Sort the table ids. This will ensure that as the balancer is repeatedly called that it works through tables in a consistent order.
  • Call the balancer for each table until a per table balancer returns migrations.
  • Return after the first table returns some migrations, do not consult any other per table balancers.

Need to experiment with this solution to see if it helps with churn.

@keith-turner keith-turner added the enhancement This issue describes a new feature, improvement, or optimization. label May 3, 2024
@cshannon
Copy link
Contributor

cshannon commented May 3, 2024

This might work but definitely needs some experimenting as you said. I'm thinking through this and a concern I have with this approach is you might end up with similar churn as depending on the table balancer algorithm (whether it's the default SimpleLoadBalancer or something custom) and you could end up having to just repeatedly re-balance the same tables but of course that's probably the case today as well.

I was trying to think through a worst case example, let's say you have a few tables and the balancing starts.

  1. Table A would be processed and find migrations so it would stop and balance A.
  2. It would run again and skip A and then go to table B and migrate Table B and stop.
  3. Third run starts and now table A needs balancing again after migrations from table B, so we rebalance A and start moving a subset of the same tablets we just moved and we stop.
  4. We run again and now A and B are ok but C needs balancing based on the table balancer, so it balances and stops.
  5. The next run now triggers a to need balancing again based on C moving and so forth.

How bad the churn is will depend on how likely the table balancing algorithm will re-trigger tables to rebalance again based on the future tables in the sorted order being balanced. The default SimpleLoadBalancer may not be that bad and this might work ok since it's just trying to spread things evenly so I think this is worth a try, but a custom balancer or other instances could be worse. Of course if someone wants to use a custom balancer then they would just keep this single table feature turned off.

I think an interesting approach that would be nice if it wasn't too hard to implement would be to try and run the algorithm as a dry run without actually performing migrations and just gathering the list of migrations. Then you could take that set of migrations and run the algorithm again. You could have it configurable to run through some number of iterations before actually performing work. This may be a bigger change though and not sure what other consequences would occur but could also be worth a shot.

@cshannon cshannon self-assigned this May 3, 2024
@EdColeman
Copy link
Contributor

I was wondering if instead of going in a predictable order if the algorithm picked a random start point? Maybe treat the list of tables (by id?) as a ring buffer, and then start somewhere random in the buffer and walk-through the candidates until you reach the start point? This may help preventing getting stuck on tables that sort first.

An alternative may be to gather a count of all tables that need migration and then work them off in order, starting with the largest counts. Would this help it converge faster?

@cshannon
Copy link
Contributor

cshannon commented May 3, 2024

I'm working on a draft of this change now to use for testing.

@EdColeman
Copy link
Contributor

Another "feature" may be to prioritize system tables - particularly the metadata table to make sure that the system tables are balanced first.

@cshannon
Copy link
Contributor

cshannon commented May 3, 2024

@EdColeman - Those are some interesting ideas as well, I think this may be something where we just need to test a couple approaches and see what works the best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This issue describes a new feature, improvement, or optimization.
Projects
None yet
Development

No branches or pull requests

3 participants