Determine if scan server tablet metadata cache blocks on refresh #4544

keith-turner · 2024-05-10T14:39:39Z

The ScanServer has a cache of tablet metadata with an expire after write time. Whan a tablet is constantly being read from this cache, what happens when the expiration time arrives? Does it drop it cause a block read for the next cache use? Ideally items in the cache that were recently used could be reloaded in such a way that the expiration does not cause blocking for the reload, but not sure if this is possible w/ caffine.

The first use of tablet should cause a blocking read on the cache, but if the tablet continues to be used frequently it would be nice if its refreshed in the background somehow w/o causing further blocking reads form the cache.

EdColeman · 2024-05-10T14:45:27Z

Maybe the refreshAfterWrite could be used? https://github.com/ben-manes/caffeine/wiki/Refresh

cshannon · 2024-05-10T14:45:38Z

I will look into this to see how the Cache is being used but Caffeine does support a refresh option which allows an updated value to be loaded in the background and the old value is returned until complete. We probably need to change the configuration here to use the refresh

cshannon · 2024-05-10T14:46:23Z

@EdColeman - I see you just beat me to the same comment :)

keith-turner · 2024-05-10T14:55:21Z

I did not know about the refresh capability, that seems like it could really smooth out scan server performance for frequently used tablets. Also, that seems to indicate that the current code is probably blocking on refreshing frequently used tablets.

cshannon · 2024-05-10T14:59:36Z

It seems like we may be able to target the changes here for 2.1 or at least for 3.1

keith-turner · 2024-05-10T15:02:15Z

It seems like we may be able to target the changes here for 2.1 or at least for 3.1

It could be a nice change for 2.1, it definitely seems like a performance bug if its periodically blocking.

keith-turner · 2024-05-10T15:21:51Z

We may be able to use this refresh capability to further smooth out performance of frequently used tablets. Looking at the caffeine refresh docs, it seems custom code can be called on refresh. We could add code there to write out any new scan server refs if needed. That could be follow on work. It would further reduce delays around frequently used tablets. That could look like the following.

Load tablet metadata for table A
Tablet A uses files F1,F2,F3 so write out scan server refs for those
Later refresh tablet A and find it uses files F1,F2,F3,F4.. as part of refresh write out a new scans server ref for F4

ben-manes · 2024-05-11T03:57:49Z

From afar, it does sound like refreshAfterWrite could work very well for you. It avoids the latency spike when popular items expire by reloading early, while allowing seldomly used ones to fade away by expiring. It is intended to hide latency.

You can override CacheLoader.reload(key, value) if you can use the value as part of your logic. This is useful when you can cheaply validate if the current value is up to date, e.g. a version / timestamp, or if you can apply a differential update rather than a full reload.

As reloads are triggered by individual lookups, it can be beneficial to batch these requests into a single load. A small buffering delay can be used to accumulate reloads over a time/space window, perform the bulk load, and complete the individual operations. This works nicely because that buffering delay is not application-facing, as it continues to receive the current value while the reload is in progress. The example uses Reactor for simplicity as it then only requires about a dozen lines of code.

Hope that helps!

cshannon · 2024-05-11T13:29:21Z

@ben-manes - Thanks for the input, that helps. One question I was wondering is that currently this uses a LoadingCache vs AsyncLoadingCache and I was wondering if there would be any reason to switch and any difference with how the refresh behavior works. The refresh operation looks like it will be done in the background even for a LoadingCache and the old value returned, so I think the current LoadingCache is ok in this case.

I think the main benefit of the AsyncLoadingCache is when the computation is slow for the load so you reduce contention on the cache when other threads are hitting it since you put the loads onto an executor. So for the sync cache I assume the performance is based on how fine grained the locking is when running computeIfAbsent(). In this case I'm not sure how much of a benefit that would be, I would need to look into it more.

This adds a property to configure the scan server tablet metadata Caffeine cache to refresh cached tablet metadata in the background on cache hits after the refresh time has passed. The refresh time is expressed as a percentage of the expiration time. This allows the cached entries to refresh before expiration if they are frequently used so that scans will not be blocked waiting on a refresh on expiration. Entries still expire if no cache hits come after the refresh time and expiration time passes. See: https://github.com/ben-manes/caffeine/wiki/Refresh This closes apache#4544

ben-manes · 2024-05-11T17:05:03Z

@cshannon, I think you're right on all points. I'll give more depth just for fun.

In regards to automatic refresh, the sync/async caches are the same. The vast majority of the code is shared (BoundedLocalCache.refreshIfNeeded) with thin adapters to the corresponding interfaces (LocalLoadingCache, LocalAsyncLoadingCache). When a refresh is triggered, it calls AsyncCacheLoader.asyncReload(k, v, executor), which CacheLoader overrides to wrap and call the synchronous reload(k, v) for convenience. There is a little juggling if the cache is storing a future value, but that's all internal details.

AsyncLoadingCache is definitely useful when avoiding lock contention on computeIfAbsent, as it shifts from ConcurrentHashMap's hashbin lock to the entry's future. That reduces the map's own operation time, but sacrifices linearizability which is normally okay. By the cache managing the pending value, another significant benefit is it allows us to offer a smarter bulk load that protects against cache stampedes like individual loads do. Other benefits are that a future also allows the caller to use a timeout or share the same error if a failure rather than trying the load anew. And of course if the caller or loader is an async/reactive chain then it naturally fits together rather than randomly blocking in an undesirable code point. That's quite nice if all calls can benefit from coalescing and not just a refresh.

A negative aspect of all async calls is the need for an executor. The cache doesn't manage its own threads so it defaults to the shared JVM-wide ForkJoinPool.commonPool() (CompletableFuture does similarly in its defaultExecutor()). That thread pool is designed for cpu-intensive work and will quickly starve on I/O due to its limited thread count. The new virtual thread executor is optimized for I/O as it elegantly layers itself on top of FJP so the blocking waits are not performed on the native threads. That feature is still too immature for production use on the latest JDKs due to footguns, but those will be resolved in future releases. When appropriate we'll switch defaults, but you might see a benefit in sharing an application-wide Executors.newCachedThreadPool() for all of your async I/O (ThreadPoolExecutor) and configuring that in Caffeine when appropriate.

cshannon · 2024-05-11T17:17:12Z

@ben-manes - Thanks for the explanation, that helps a lot. Sounds like the Virtual thread support will be quite nice when it's ready in the future.

cshannon self-assigned this May 10, 2024

cshannon mentioned this issue May 11, 2024

Enable background refresh for the scan server tablet metadata cache #4551

Merged

cshannon closed this as completed in 5d3ccec May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine if scan server tablet metadata cache blocks on refresh #4544

Determine if scan server tablet metadata cache blocks on refresh #4544

keith-turner commented May 10, 2024

EdColeman commented May 10, 2024

cshannon commented May 10, 2024

cshannon commented May 10, 2024

keith-turner commented May 10, 2024

cshannon commented May 10, 2024

keith-turner commented May 10, 2024

keith-turner commented May 10, 2024

ben-manes commented May 11, 2024 •

edited

cshannon commented May 11, 2024

ben-manes commented May 11, 2024

cshannon commented May 11, 2024

Determine if scan server tablet metadata cache blocks on refresh #4544

Determine if scan server tablet metadata cache blocks on refresh #4544

Comments

keith-turner commented May 10, 2024

EdColeman commented May 10, 2024

cshannon commented May 10, 2024

cshannon commented May 10, 2024

keith-turner commented May 10, 2024

cshannon commented May 10, 2024

keith-turner commented May 10, 2024

keith-turner commented May 10, 2024

ben-manes commented May 11, 2024 • edited

cshannon commented May 11, 2024

ben-manes commented May 11, 2024

cshannon commented May 11, 2024

ben-manes commented May 11, 2024 •

edited