Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tablet unload impacted by long-running compaction cancellation #4485

Open
dtspence opened this issue Apr 22, 2024 · 9 comments
Open

Tablet unload impacted by long-running compaction cancellation #4485

dtspence opened this issue Apr 22, 2024 · 9 comments
Labels
bug This issue has been verified to be a bug.

Comments

@dtspence
Copy link
Contributor

Describe the bug
A tablet unload (i.e. due to migration request) may be delayed while a tablet is attempting to unload, but cannot due to pending compaction cancellations. We have observed that the tablet will wait for as long as 50+ minutes while compactions cancel.

Versions (OS, Maven, Java, and others, as appropriate):

  • Affected version(s) of this project: 2.1.2

To Reproduce
We are attempting to gather additional information to reproduce. Some preliminary information:

  • Tablets are being compacted by t-servers (i.e. not using external compactions).
  • Data being compacted is not expected to be filtered, however we are unsure if somehow the iterator may not be returning for an extended time.

Expected behavior
Migration request should complete within some shorter time.

Screenshots
N/A

Additional context
The manager logs:

2024-04-22T16:23:58,434 [balancer.HostRegexTableLoadBalancer] WARN: Not balancing tables due to 1 outstanding migrations
@dtspence dtspence added the bug This issue has been verified to be a bug. label Apr 22, 2024
@dtspence dtspence changed the title Tablet unload may be impacted by long-running compaction cancellation Tablet unload impacted by long-running compaction cancellation Apr 22, 2024
@dlmarion
Copy link
Contributor

@dtspence - is it the case that you manually cancelled the compactions? If so, did your command complete, or did it hang too?

@dlmarion
Copy link
Contributor

The FileCompactor checks if the compaction is still enabled for every key that it writes. I'm curious if the compaction was making progress (you said filtering was not expected which could also be a cause). Is this happening often? If not, is bouncing the tserver an option?

@dtspence
Copy link
Contributor Author

@dlmarion

is it the case that you manually cancelled the compactions? If so, did your command complete, or did it hang too?

No, the compaction eventually logs that it has canceled. We have not been taking manual intervention.

The FileCompactor checks if the compaction is still enabled for every key that it writes. I'm curious if the compaction was making progress (you said filtering was not expected which could also be a cause). Is this happening often? If not, is bouncing the tserver an option?

Yes, the issue is re-appearing. It does not appear to be localized to a single t-server. We have been wondering if something tablet related that correlates to the issue. At least one tablet we were looking at was a hot-spot and contained a lot of i-files from imports.

@dlmarion
Copy link
Contributor

dlmarion commented Apr 23, 2024

@dtspence - are you seeing a lot of these messages?

@dtspence
Copy link
Contributor Author

@dlmarion

are you seeing a lot of these messages?

No we are not seeing the message above. Just for reference, it may have already been known - but we see:

2024-04-23T17:58:17,298 [tserver.UnloadTabletHandler] DEBUG: Failed to unload tablet <tablet-name> ... it was already closing or closed

@dlmarion
Copy link
Contributor

I think that log message might be from the Manager continuing to tell the TabletServer to unload the tablet.

@dlmarion
Copy link
Contributor

I'm still thinking that maybe the compaction is not making progress. I don't think there is good logging for this with compactions that run in the Tablet Server. IIRC, the way to tell if it's making progress is to check out the output file for the compaction in HDFS and see if its size is increasing. If nothing is getting written to the file for a long time, then either it's filtering out a lot of data, or it's waiting on input from HDFS.

@ivakegg
Copy link
Contributor

ivakegg commented Apr 29, 2024

I would love to have something in place to avoid compaction from holding up unloading of tablets. Is this something that is relative easy to do? This would save us from long shutdowns as well.

@dlmarion
Copy link
Contributor

Is this something that is relative easy to do?

It's not a switch that exists today. We would need to develop and test a solution. If you can identify which compactions are causing a tablet not to close, then you could run them as External Compactions. The existing codepath does not wait for the External Compactions to complete. It only waits for them if they are in the process of committing their changes to the tablet metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue has been verified to be a bug.
Projects
None yet
Development

No branches or pull requests

3 participants