Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine if scan server tablet metadata cache blocks on refresh #4544

Closed
keith-turner opened this issue May 10, 2024 · 11 comments · Fixed by #4551
Closed

Determine if scan server tablet metadata cache blocks on refresh #4544

keith-turner opened this issue May 10, 2024 · 11 comments · Fixed by #4551
Assignees

Comments

@keith-turner
Copy link
Contributor

The ScanServer has a cache of tablet metadata with an expire after write time. Whan a tablet is constantly being read from this cache, what happens when the expiration time arrives? Does it drop it cause a block read for the next cache use? Ideally items in the cache that were recently used could be reloaded in such a way that the expiration does not cause blocking for the reload, but not sure if this is possible w/ caffine.

The first use of tablet should cause a blocking read on the cache, but if the tablet continues to be used frequently it would be nice if its refreshed in the background somehow w/o causing further blocking reads form the cache.

@cshannon cshannon self-assigned this May 10, 2024
@EdColeman
Copy link
Contributor

Maybe the refreshAfterWrite could be used? https://github.com/ben-manes/caffeine/wiki/Refresh

@cshannon
Copy link
Contributor

I will look into this to see how the Cache is being used but Caffeine does support a refresh option which allows an updated value to be loaded in the background and the old value is returned until complete. We probably need to change the configuration here to use the refresh

@cshannon
Copy link
Contributor

@EdColeman - I see you just beat me to the same comment :)

@keith-turner
Copy link
Contributor Author

I did not know about the refresh capability, that seems like it could really smooth out scan server performance for frequently used tablets. Also, that seems to indicate that the current code is probably blocking on refreshing frequently used tablets.

@cshannon
Copy link
Contributor

It seems like we may be able to target the changes here for 2.1 or at least for 3.1

@keith-turner
Copy link
Contributor Author

It seems like we may be able to target the changes here for 2.1 or at least for 3.1

It could be a nice change for 2.1, it definitely seems like a performance bug if its periodically blocking.

@keith-turner
Copy link
Contributor Author

We may be able to use this refresh capability to further smooth out performance of frequently used tablets. Looking at the caffeine refresh docs, it seems custom code can be called on refresh. We could add code there to write out any new scan server refs if needed. That could be follow on work. It would further reduce delays around frequently used tablets. That could look like the following.

  1. Load tablet metadata for table A
  2. Tablet A uses files F1,F2,F3 so write out scan server refs for those
  3. Later refresh tablet A and find it uses files F1,F2,F3,F4.. as part of refresh write out a new scans server ref for F4

@ben-manes
Copy link
Contributor

ben-manes commented May 11, 2024

From afar, it does sound like refreshAfterWrite could work very well for you. It avoids the latency spike when popular items expire by reloading early, while allowing seldomly used ones to fade away by expiring. It is intended to hide latency.

You can override CacheLoader.reload(key, value) if you can use the value as part of your logic. This is useful when you can cheaply validate if the current value is up to date, e.g. a version / timestamp, or if you can apply a differential update rather than a full reload.

As reloads are triggered by individual lookups, it can be beneficial to batch these requests into a single load. A small buffering delay can be used to accumulate reloads over a time/space window, perform the bulk load, and complete the individual operations. This works nicely because that buffering delay is not application-facing, as it continues to receive the current value while the reload is in progress. The example uses Reactor for simplicity as it then only requires about a dozen lines of code.

Hope that helps!

@cshannon
Copy link
Contributor

@ben-manes - Thanks for the input, that helps. One question I was wondering is that currently this uses a LoadingCache vs AsyncLoadingCache and I was wondering if there would be any reason to switch and any difference with how the refresh behavior works. The refresh operation looks like it will be done in the background even for a LoadingCache and the old value returned, so I think the current LoadingCache is ok in this case.

I think the main benefit of the AsyncLoadingCache is when the computation is slow for the load so you reduce contention on the cache when other threads are hitting it since you put the loads onto an executor. So for the sync cache I assume the performance is based on how fine grained the locking is when running computeIfAbsent(). In this case I'm not sure how much of a benefit that would be, I would need to look into it more.

cshannon added a commit to cshannon/accumulo that referenced this issue May 11, 2024
This adds a property to configure the scan server tablet metadata
Caffeine cache to refresh cached tablet metadata in the background on
cache hits after the refresh time has passed. The refresh time is
expressed as a percentage of the expiration time. This allows the cached
entries to refresh before expiration if they are frequently used so that
scans will not be blocked waiting on a refresh on expiration. Entries
still expire if no cache hits come after the refresh time and expiration
time passes.

See: https://github.com/ben-manes/caffeine/wiki/Refresh

This closes apache#4544
cshannon added a commit to cshannon/accumulo that referenced this issue May 11, 2024
This adds a property to configure the scan server tablet metadata
Caffeine cache to refresh cached tablet metadata in the background on
cache hits after the refresh time has passed. The refresh time is
expressed as a percentage of the expiration time. This allows the cached
entries to refresh before expiration if they are frequently used so that
scans will not be blocked waiting on a refresh on expiration. Entries
still expire if no cache hits come after the refresh time and expiration
time passes.

See: https://github.com/ben-manes/caffeine/wiki/Refresh

This closes apache#4544
@ben-manes
Copy link
Contributor

@cshannon, I think you're right on all points. I'll give more depth just for fun.

In regards to automatic refresh, the sync/async caches are the same. The vast majority of the code is shared (BoundedLocalCache.refreshIfNeeded) with thin adapters to the corresponding interfaces (LocalLoadingCache, LocalAsyncLoadingCache). When a refresh is triggered, it calls AsyncCacheLoader.asyncReload(k, v, executor), which CacheLoader overrides to wrap and call the synchronous reload(k, v) for convenience. There is a little juggling if the cache is storing a future value, but that's all internal details.

AsyncLoadingCache is definitely useful when avoiding lock contention on computeIfAbsent, as it shifts from ConcurrentHashMap's hashbin lock to the entry's future. That reduces the map's own operation time, but sacrifices linearizability which is normally okay. By the cache managing the pending value, another significant benefit is it allows us to offer a smarter bulk load that protects against cache stampedes like individual loads do. Other benefits are that a future also allows the caller to use a timeout or share the same error if a failure rather than trying the load anew. And of course if the caller or loader is an async/reactive chain then it naturally fits together rather than randomly blocking in an undesirable code point. That's quite nice if all calls can benefit from coalescing and not just a refresh.

A negative aspect of all async calls is the need for an executor. The cache doesn't manage its own threads so it defaults to the shared JVM-wide ForkJoinPool.commonPool() (CompletableFuture does similarly in its defaultExecutor()). That thread pool is designed for cpu-intensive work and will quickly starve on I/O due to its limited thread count. The new virtual thread executor is optimized for I/O as it elegantly layers itself on top of FJP so the blocking waits are not performed on the native threads. That feature is still too immature for production use on the latest JDKs due to footguns, but those will be resolved in future releases. When appropriate we'll switch defaults, but you might see a benefit in sharing an application-wide Executors.newCachedThreadPool() for all of your async I/O (ThreadPoolExecutor) and configuring that in Caffeine when appropriate.

@cshannon
Copy link
Contributor

@ben-manes - Thanks for the explanation, that helps a lot. Sounds like the Virtual thread support will be quite nice when it's ready in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants