Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve log recovery times #4505

Open
keith-turner opened this issue Apr 30, 2024 · 3 comments
Open

Improve log recovery times #4505

keith-turner opened this issue Apr 30, 2024 · 3 comments
Labels
enhancement This issue describes a new feature, improvement, or optimization.

Comments

@keith-turner
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Write ahead log recovery can take a while because of the following two behaviors.

  • Tablet servers processes only do a single log recovery at time
  • All tablets, even if they have no data in the write ahead log, will go through the log recovery process when being loaded on a tablet.

Those behaviors make log recovery times correlate with the number of tablets per tserver. So as the number of tablets per tserver increases, log recovery time increases.

Describe the solution you'd like

Allow parallel log recovery and faster log recovery. The parallelism is related to #4429, but that change does not completely solve the issue as the lock is still acquired for log recovery.

  • Use a cache during log recovery when reading from sorted walog rfiles
  • Inspect tablet w/ logs before acquiring recovery lock to see if they contain data

Describe alternatives you've considered

Could potentially produce an F file for log recovery outside of the tablet server somewhere (similar to external compactions). This may have been discussed on an elasticity related issue, but could not find it. This would be a much larger change and probably would be suitable to do in 2.1. It may require completly refactoring the tablet minor compaction code to make it usable elsewhere.

@keith-turner keith-turner added the enhancement This issue describes a new feature, improvement, or optimization. label Apr 30, 2024
@keith-turner keith-turner added this to To do in 2.1.3 via automation Apr 30, 2024
@keith-turner keith-turner added this to To do in 3.1.0 via automation Apr 30, 2024
@ctubbsii
Copy link
Member

ctubbsii commented Apr 30, 2024

and probably would be suitable to do in 2.1

Did you mean "would not be"?

@dlmarion
Copy link
Contributor

@keith-turner - you might be thinking of #4239 where I modified the code such that all Tablet Servers, Scan Servers, and Compactors participated in log recovery. I'm not sure if this is something that could be backported to an earlier version as it may depend on other changes in elasticity w/r/t tablet hosting and tablet management.

@keith-turner
Copy link
Contributor Author

@keith-turner - you might be thinking of #4239 where I modified the code such that all Tablet Servers, Scan Servers, and Compactors participated in log recovery. I'm not sure if this is something that could be backported to an earlier version as it may depend on other changes in elasticity w/r/t tablet hosting and tablet management.

That change could speed up log sorting. The problem in this issue happens after the logs are sorted and when tablets w/ sorted walogs are loaded on a tablet server. Tablet severs only load one tablet w/ walogs at time which is what makes things slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This issue describes a new feature, improvement, or optimization.
Projects
2.1.3
To do
3.1.0
To do
Status: No status
Development

No branches or pull requests

3 participants