Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention. #308

lipzhu · 2024-04-12T05:36:34Z

Description

This patch try to introduce Thread-local storage variable to replace atomic for zmalloc to reduce unnecessary contention.

Problem Statement

zmalloc and zfree related functions will update the used_memory for each operation, and they are called very frequency. From the benchmark of memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml , the cycles ratio of zmalloc and zfree are high, they are wrappers for the lower allocator library, it should not take too much cycles. And most of the cycles are contributed by lock add and lock sub , they are expensive instructions. From the profiling, the metrics' update mainly come from the main thread, use a TLS will reduce a lot of contention.

Performance Impact

Test Env

OS: CentOS Stream 8
Kernel: 6.2.0
Platform: Intel Xeon Platinum 8380
Server and Client in same socket

Start Server

taskset -c 0 ~/valkey/src/valkey-server /tmp/valkey_1.conf

port 9001
bind * -::*
daemonize yes
protected-mode no
save ""

By using the benchmark memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml.
memtier_benchmark -s 127.0.0.1 -p 9001 "--pipeline" "10" "--data-size" "100" --command "XADD __key__ MAXLEN ~ 1 * field __data__" --command-key-pattern="P" --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 4 --hide-histogram

We can observe more than 6% QPS gain.

For the other benchmarks SET/GET, using the commands like:
taskset -c 6-9 ~/valkey/src/valkey-benchmark -p 9001 -t set,get -d 100 -r 1000000 -n 1000000 -c 50 --threads 4

No perf gain and regression.

With pipeline enabled, I can observe 4% perf gain with test case.
taskset -c 4-7 memtier_benchmark -s 127.0.0.1 -p 9001 "--pipeline" "10" "--data-size" "100" --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 4 --hide-histogram

…ontention. Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>

madolson

What are the server configurations you are using for the test? I didn't see them listed. Can you also just do a simple test with get/set?

src/zmalloc.c

madolson · 2024-04-14T19:37:32Z

src/zmalloc.c

+ * used_memory_tls array. */
+static __thread int thread_index;
+/* MAX_THREADS_NUM = IO_THREADS_MAX_NUM(128) + BIO threads(3) + main thread(1). */
+#define MAX_THREADS_NUM 132


Suggested change

#define MAX_THREADS_NUM 132

#define MAX_THREADS_NUM (IO_THREADS_MAX_NUM + BIO_THREAD_COUNT + 1)

Can we do something like this? I think we need to implement a new define for BIO thread count. Then we don't have to worry about this changing.

IO_THREADS_MAX_NUM is defined in networking.c and BIO_THREAD_COUNT has a similar define BIO_WORKER_NUM in bio.c. How do we reuse the define in the zmalloc.c ? It is a very low level API.

Yeah there is a layering concern here. Can you add a c_assert to make sure the condition of (IO_THREADS_MAX_NUM + BIO_THREAD_COUNT + 1) < IO_THREADS_MAX_NUM is not violated?

btw, if we go with this route, we should probably introduce a new header to host all thread local storage related definition. It feels a bit out of place to define MAX_THREADS_NUM in zmalloc.c. We might have other consumers of this macro in the future. I wonder if we should call the new header system.h, which would be at the lowest layer.

madolson · 2024-04-14T19:48:45Z

src/zmalloc.c

-    atomicGet(used_memory,um);
+    size_t um = 0;
+    for (int i = 0; i < total_active_threads; i++) {
+        um += used_memory_tls[i];


Suggested change

um += used_memory_tls[i];

serverAssert(i < MAX_THREADS_NUM);

um += used_memory_tls[i];

madolson · 2024-04-14T19:49:35Z

src/zmalloc.c

+
+/* Register the thread index in start_routine. */
+void zmalloc_register_thread_index(void) {
+    atomicGetIncr(total_active_threads, thread_index, 1);


Suggested change

atomicGetIncr(total_active_threads, thread_index, 1);

serverAssert(total_active_threads < MAX_THREADS_NUM);

atomicGetIncr(total_active_threads, thread_index, 1);

Since both total_active_threads and thread_index are of size_t, whose sizes match the native word size of the platform, the proposed change should work.

Strictly speaking though (if size_t was 64 bit on a 32-bit platform), L112 and L113 need to be flipped. We also shouldn't touch the global var total_active_threads without the atomic operation; instead, we should access the thread local var thread_index.

Suggested change

atomicGetIncr(total_active_threads, thread_index, 1);

atomicGetIncr(total_active_threads, thread_index, 1);

serverAssert(thread_index < MAX_THREADS_NUM);

src/zmalloc.c

lipzhu · 2024-04-15T01:34:30Z

What are the server configurations you are using for the test? I didn't see them listed. Can you also just do a simple test with get/set?

Just update the server config and SET/GET results in top comment.

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>

qlong · 2024-04-15T03:24:55Z

src/zmalloc.c

-    atomicGet(used_memory,um);
+    assert(total_active_threads < MAX_THREADS_NUM);
+    size_t um = 0;
+    for (int i = 0; i < total_active_threads; i++) {


Wondering how important is it for zmalloc_used_memory to return the correct used_memory? As we loop through to add up the memory used by each thread, the threads that had been counted may continue to allocate or to free memory, so the sum might under or over account the actual memory used by all threads.

I don't think this is an issue if the function is only used for metrics, but overMaxmemoryAfterAlloc (evict.c) is not for metrics, we probably should ensure (and call out) that undercounting memory usage would not be an issue there

@qlong I understand your concern, at the beginning, I want to keep an approximate value for used_memory, each thread keep its own used_memory_thread, only used_memory_thread greater than a threshold(e.g. 1k) then update to global atomic used_memory, the pseudocode will be like this, but I saw the test case in https://github.com/valkey-io/valkey/blob/unstable/src/zmalloc.c#L917, seems it need an accurate used_memory? Back to this PR, each element in used_memory_thread array only have one writer to update the value, when read, the reader may have gap between oldvalue and newvalue, depends on the read timing.
@madolson What's your thought?

-#define update_zmalloc_stat_alloc(__n) atomicIncr(used_memory,(__n)) -#define update_zmalloc_stat_free(__n) atomicDecr(used_memory,(__n)) +#define update_zmalloc_stat_alloc(__n) do { \ + used_memory_thread += (__n); \ + if (used_memory_thread > 1024) atomicIncr(used_memory, used_memory_thread); \ +} while (0) \ + +#define update_zmalloc_stat_free(__n) do { \ + used_memory_thread -= (__n); \ + if (used_memory_thread > 1024) atomicDecr(used_memory, used_memory_thread); \ +} while (0) \ static serverAtomic size_t used_memory = 0; +static __thread size_t used_memory_thread = 0;

a few quick thoughts

I see value in removing atomic operations

Since we use size_t here for the counters, which should be natively aligned as well, the worst we can get would be staleness as opposed to inconsistency.

The absolute accuracy doesn't have much value for metrics. The value will get stale immediately after the INFO command returns, whether it is absolutely accurate or not. As long as the value is close enough I think we are good. We can change the tests

Agreed that overMaxmemoryAfterAlloc is a higher bar than metrics. My gut feeling is that the staleness brought by the relaxed memory access should still be negligible. I doubt we could accumulate a lot of changes before the CPU flushes out its cache to the main memory. Curious to know if anyone has more definitive information.

The absolute accuracy requirement is directly at odds with the performance overhead incurred by the atomic operation. So one of them has to go.

Seems we all had concern about the absolute accuracy of used_memory in overMaxmemoryAfterAlloc, IMO, this is not a problem, because in the definition of overMaxmemoryAfterAlloc , the comparison between moremem and used_memory (Their semantics are different) is not absolute accuracy if we consider the gap between requested memory and actual allocated memory from allocator.

int overMaxmemoryAfterAlloc(size_t moremem) { if (!server.maxmemory) return 0; /* No limit. */ /* Check quickly. */ size_t mem_used = zmalloc_used_memory(); if (mem_used + moremem <= server.maxmemory) return 0; size_t overhead = freeMemoryGetNotCountedMemory(); mem_used = (mem_used > overhead) ? mem_used - overhead : 0; return mem_used + moremem > server.maxmemory; }

Kindly ping @PingXie , any comments about this, do we need to continue this?

I think there is a way to address this concern while retaining the most of the performance benefit if we keep the existing io-threading model. The idea would be to keep a mirrored per-thread counter array such that each thread periodically "flushes" its thread local counter value to the corresponding element in the global per-thread counter array using atomic operations. This would work with the current io-threading model where io-threads and main threads run in lock-steps. Every io thread would need to push its local value to this global array when it is done with its work, but only for once. Then by the time the main thread starts executing, it would have the accurate data. We can reduce the number of atomic operations to one per io-thread per epoll session.

We would have to revisit this if we went for a different threading model (such as #22) .

That said, I am still not sure how accurate this information needs to be. zmalloc_used_memory doesn't represent the RSS usage of the process anyways. I think it would be really helpful if we could somehow establish the worst possible case, empirically or even a SWAG.

That said, I am still not sure how accurate this information needs to be. zmalloc_used_memory doesn't represent the RSS usage of the process anyways.

IMO, there is no meaningful of the absolute accurate zmalloc_used_memory, to achieve the goal, the cost is very high, not only the expensive atomic operation in every zmalloc/zfree operation, but also the call of je_malloc_usable_size is very high (From the cycles hotspot snapshot in top comment, you can find the cycles ratio of je_malloc_usable_size is the top 1 hotspot). If we can accept the a little latency of zmalloc_used_memory, ideally we may have at least 10% perf gain.

@PingXie Do you think we need open an issue to discuss with the rest of @valkey-io/core-team of the necessary of absolute accurate zmalloc_used_memory ?

I saw you opened #467. Thanks @lipzhu.

qlong · 2024-04-15T04:45:13Z

src/zmalloc.c

@@ -87,10 +88,23 @@ void zlibc_free(void *ptr) {
 #define dallocx(ptr,flags) je_dallocx(ptr,flags)
 #endif

-#define update_zmalloc_stat_alloc(__n) atomicIncr(used_memory,(__n))
-#define update_zmalloc_stat_free(__n) atomicDecr(used_memory,(__n))
+#define update_zmalloc_stat_alloc(__n) used_memory_thread[thread_index] += (__n)


thread_index is thread local, but used_memory_thread[thread_index] is not. Without some sort of locking, a thread might get strange value from used_memory_thread[thread_index] if another thread is updating it

What do you mean about the strange value? The reader can only read old_value or new_value in this case.

@lipzhu I meant corrupted value due to race condition among threads. I did not see (maybe I have missed) how we guarantee reader read either old or new value. The reader can see corrupted value of used_memory_thread[i] if the writer is updating used_memory_thread[i] at the same time

Yes, the semantics is a little different from the origin implementation, but do we really need such accurate used_memory? The cost is to add the expensive lock add/sub instruction in the low-level API like zmalloc/zfree functions, especially most calls to these functions come from the main thread.
And the new_value will be seen in next read.

@lipzhu I think data corruption caused by race condition is not about accuracy, the corrupted data can be random from 0 to 2^64-1. This can lead to system correctness or stability issue.

The original implementation uses atomic number, which leverages support from CPU and is quite efficient as counter that gets updated frequently from different threads.

src/zmalloc.c

lipzhu · 2024-04-17T00:47:42Z

@valkey-io/core-team, could you help to take a look at this patch?

enjoy-binbin

i did not take a deep look, how do we handle module threads?

lipzhu · 2024-04-22T03:36:41Z

Hi @enjoy-binbin , thanks for your comments for this patch. BTW, your blog in yuque help me a lot in understanding redis/valkey.

i did not take a deep look, how do we handle module threads?

Sorry I missed the modules thread, maybe I can remove the explicit call of the zmalloc_register_thread_index in start routine, initialize static __thread int thread_index = -1 and then add a checker if (thread_index == -1) zmalloc_register_thread_index() in zmalloc/zfree?

codecov · 2024-04-22T09:58:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (unstable@f4f1bd6). Click here to learn what that means.

Additional details and impacted files

@@             Coverage Diff             @@
##             unstable     #308   +/-   ##
===========================================
  Coverage            ?   68.41%           
===========================================
  Files               ?      108           
  Lines               ?    61559           
  Branches            ?        0           
===========================================
  Hits                ?    42115           
  Misses              ?    19444           
  Partials            ?        0

Files	Coverage Δ
src/zmalloc.c	`83.71% <100.00%> (ø)`

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>

lipzhu · 2024-04-28T02:17:24Z

@enjoy-binbin @PingXie @madolson Thoughts about this patch?

Introduce TLS variable for used_memory metrics update to reduce the c…

69676d4

…ontention. Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>

lipzhu changed the title ~~Introduce TLS variable to replace atomic when update used_memory metrics to reduce contention.~~ Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention. Apr 12, 2024

madolson reviewed Apr 14, 2024

View reviewed changes

Add assertions and rename variables.

bc4cc6e

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>

qlong reviewed Apr 15, 2024

View reviewed changes

enjoy-binbin reviewed Apr 19, 2024

View reviewed changes

lipzhu requested review from enjoy-binbin and madolson April 22, 2024 09:46

Make thread index register lazy.

45071b5

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>

PingXie added the performance label Apr 28, 2024

This was referenced May 8, 2024

Add zfree_with_size #453

Open

Significance of absolute accurate used_memory? #467

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention. #308

Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention. #308

lipzhu commented Apr 12, 2024 •

edited

madolson left a comment

madolson Apr 14, 2024

lipzhu Apr 15, 2024

PingXie Apr 28, 2024

PingXie Apr 28, 2024

madolson Apr 14, 2024

madolson Apr 14, 2024

PingXie Apr 28, 2024

lipzhu commented Apr 15, 2024

qlong Apr 15, 2024 •

edited

lipzhu Apr 16, 2024

PingXie Apr 28, 2024

lipzhu Apr 30, 2024 •

edited

lipzhu May 6, 2024

PingXie May 6, 2024

lipzhu May 7, 2024 •

edited

PingXie May 8, 2024

qlong Apr 15, 2024

lipzhu Apr 16, 2024

qlong Apr 17, 2024

lipzhu Apr 18, 2024

qlong Apr 18, 2024

lipzhu commented Apr 17, 2024

enjoy-binbin left a comment

lipzhu commented Apr 22, 2024 •

edited

codecov bot commented Apr 22, 2024 •

edited

lipzhu commented Apr 28, 2024

	#define MAX_THREADS_NUM 132
	#define MAX_THREADS_NUM (IO_THREADS_MAX_NUM + BIO_THREAD_COUNT + 1)

	um += used_memory_tls[i];
	serverAssert(i < MAX_THREADS_NUM);
	um += used_memory_tls[i];

	atomicGetIncr(total_active_threads, thread_index, 1);
	serverAssert(total_active_threads < MAX_THREADS_NUM);
	atomicGetIncr(total_active_threads, thread_index, 1);

	atomicGetIncr(total_active_threads, thread_index, 1);
	atomicGetIncr(total_active_threads, thread_index, 1);
	serverAssert(thread_index < MAX_THREADS_NUM);

Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention. #308

Are you sure you want to change the base?

Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention. #308

Conversation

lipzhu commented Apr 12, 2024 • edited

Description

Problem Statement

Performance Impact

Test Env

Start Server

madolson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lipzhu commented Apr 15, 2024

qlong Apr 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lipzhu Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lipzhu May 7, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lipzhu commented Apr 17, 2024

enjoy-binbin left a comment

Choose a reason for hiding this comment

lipzhu commented Apr 22, 2024 • edited

codecov bot commented Apr 22, 2024 • edited

Codecov Report

lipzhu commented Apr 28, 2024

lipzhu commented Apr 12, 2024 •

edited

qlong Apr 15, 2024 •

edited

lipzhu Apr 30, 2024 •

edited

lipzhu May 7, 2024 •

edited

lipzhu commented Apr 22, 2024 •

edited

codecov bot commented Apr 22, 2024 •

edited