New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention. #308
base: unstable
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -52,6 +52,7 @@ void zlibc_free(void *ptr) { | |||||||||||||
#include <string.h> | ||||||||||||||
#include "zmalloc.h" | ||||||||||||||
#include "atomicvar.h" | ||||||||||||||
#include "serverassert.h" | ||||||||||||||
|
||||||||||||||
#define UNUSED(x) ((void)(x)) | ||||||||||||||
|
||||||||||||||
|
@@ -87,10 +88,23 @@ void zlibc_free(void *ptr) { | |||||||||||||
#define dallocx(ptr,flags) je_dallocx(ptr,flags) | ||||||||||||||
#endif | ||||||||||||||
|
||||||||||||||
#define update_zmalloc_stat_alloc(__n) atomicIncr(used_memory,(__n)) | ||||||||||||||
#define update_zmalloc_stat_free(__n) atomicDecr(used_memory,(__n)) | ||||||||||||||
#define update_zmalloc_stat_alloc(__n) used_memory_thread[thread_index] += (__n) | ||||||||||||||
#define update_zmalloc_stat_free(__n) used_memory_thread[thread_index] -= (__n) | ||||||||||||||
|
||||||||||||||
/* A thread-local storage which keep the current thread's index in the | ||||||||||||||
* used_memory_thread array. */ | ||||||||||||||
static __thread int thread_index; | ||||||||||||||
/* MAX_THREADS_NUM = IO_THREADS_MAX_NUM(128) + BIO threads(3) + main thread(1). */ | ||||||||||||||
#define MAX_THREADS_NUM 132 | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Can we do something like this? I think we need to implement a new define for BIO thread count. Then we don't have to worry about this changing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah there is a layering concern here. Can you add a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. btw, if we go with this route, we should probably introduce a new header to host all thread local storage related definition. It feels a bit out of place to define |
||||||||||||||
static unsigned long long used_memory_thread[MAX_THREADS_NUM]; | ||||||||||||||
|
||||||||||||||
static serverAtomic size_t used_memory = 0; | ||||||||||||||
static serverAtomic int total_active_threads; | ||||||||||||||
|
||||||||||||||
/* Register the thread index in start_routine. */ | ||||||||||||||
void zmalloc_register_thread_index(void) { | ||||||||||||||
atomicGetIncr(total_active_threads, thread_index, 1); | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since both Strictly speaking though (if
Suggested change
|
||||||||||||||
assert(total_active_threads < MAX_THREADS_NUM); | ||||||||||||||
} | ||||||||||||||
|
||||||||||||||
static void zmalloc_default_oom(size_t size) { | ||||||||||||||
fprintf(stderr, "zmalloc: Out of memory trying to allocate %zu bytes\n", | ||||||||||||||
|
@@ -409,8 +423,11 @@ char *zstrdup(const char *s) { | |||||||||||||
} | ||||||||||||||
|
||||||||||||||
size_t zmalloc_used_memory(void) { | ||||||||||||||
size_t um; | ||||||||||||||
atomicGet(used_memory,um); | ||||||||||||||
assert(total_active_threads < MAX_THREADS_NUM); | ||||||||||||||
size_t um = 0; | ||||||||||||||
for (int i = 0; i < total_active_threads; i++) { | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wondering how important is it for zmalloc_used_memory to return the correct used_memory? As we loop through to add up the memory used by each thread, the threads that had been counted may continue to allocate or to free memory, so the sum might under or over account the actual memory used by all threads. I don't think this is an issue if the function is only used for metrics, but overMaxmemoryAfterAlloc (evict.c) is not for metrics, we probably should ensure (and call out) that undercounting memory usage would not be an issue there There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @qlong I understand your concern, at the beginning, I want to keep an approximate value for used_memory, each thread keep its own used_memory_thread, only used_memory_thread greater than a threshold(e.g. 1k) then update to global atomic used_memory, the pseudocode will be like this, but I saw the test case in https://github.com/valkey-io/valkey/blob/unstable/src/zmalloc.c#L917, seems it need an accurate used_memory? Back to this PR, each element in used_memory_thread array only have one writer to update the value, when read, the reader may have gap between -#define update_zmalloc_stat_alloc(__n) atomicIncr(used_memory,(__n))
-#define update_zmalloc_stat_free(__n) atomicDecr(used_memory,(__n))
+#define update_zmalloc_stat_alloc(__n) do { \
+ used_memory_thread += (__n); \
+ if (used_memory_thread > 1024) atomicIncr(used_memory, used_memory_thread); \
+} while (0) \
+
+#define update_zmalloc_stat_free(__n) do { \
+ used_memory_thread -= (__n); \
+ if (used_memory_thread > 1024) atomicDecr(used_memory, used_memory_thread); \
+} while (0) \
static serverAtomic size_t used_memory = 0;
+static __thread size_t used_memory_thread = 0; There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a few quick thoughts
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems we all had concern about the absolute accuracy of int overMaxmemoryAfterAlloc(size_t moremem) {
if (!server.maxmemory) return 0; /* No limit. */
/* Check quickly. */
size_t mem_used = zmalloc_used_memory();
if (mem_used + moremem <= server.maxmemory) return 0;
size_t overhead = freeMemoryGetNotCountedMemory();
mem_used = (mem_used > overhead) ? mem_used - overhead : 0;
return mem_used + moremem > server.maxmemory;
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Kindly ping @PingXie , any comments about this, do we need to continue this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there is a way to address this concern while retaining the most of the performance benefit if we keep the existing io-threading model. The idea would be to keep a mirrored per-thread counter array such that each thread periodically "flushes" its thread local counter value to the corresponding element in the global per-thread counter array using atomic operations. This would work with the current io-threading model where io-threads and main threads run in lock-steps. Every io thread would need to push its local value to this global array when it is done with its work, but only for once. Then by the time the main thread starts executing, it would have the accurate data. We can reduce the number of atomic operations to one per io-thread per epoll session. We would have to revisit this if we went for a different threading model (such as #22) . That said, I am still not sure how accurate this information needs to be. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
IMO, there is no meaningful of the absolute accurate @PingXie Do you think we need open an issue to discuss with the rest of @valkey-io/core-team of the necessary of absolute accurate There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||||||
um += used_memory_thread[i]; | ||||||||||||||
lipzhu marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
} | ||||||||||||||
return um; | ||||||||||||||
} | ||||||||||||||
|
||||||||||||||
|
@@ -628,8 +645,6 @@ size_t zmalloc_get_rss(void) { | |||||||||||||
|
||||||||||||||
#if defined(USE_JEMALLOC) | ||||||||||||||
|
||||||||||||||
#include "serverassert.h" | ||||||||||||||
|
||||||||||||||
#define STRINGIFY_(x) #x | ||||||||||||||
#define STRINGIFY(x) STRINGIFY_(x) | ||||||||||||||
|
||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thread_index is thread local, but used_memory_thread[thread_index] is not. Without some sort of locking, a thread might get strange value from used_memory_thread[thread_index] if another thread is updating it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean about the
strange value
? The reader can only readold_value
ornew_value
in this case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lipzhu I meant corrupted value due to race condition among threads. I did not see (maybe I have missed) how we guarantee reader read either old or new value. The reader can see corrupted value of used_memory_thread[i] if the writer is updating used_memory_thread[i] at the same time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the semantics is a little different from the origin implementation, but do we really need such accurate
used_memory
? The cost is to add the expensivelock add/sub
instruction in the low-level API like zmalloc/zfree functions, especially most calls to these functions come from the main thread.And the
new_value
will be seen in next read.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lipzhu I think data corruption caused by race condition is not about accuracy, the corrupted data can be random from 0 to 2^64-1. This can lead to system correctness or stability issue.
The original implementation uses atomic number, which leverages support from CPU and is quite efficient as counter that gets updated frequently from different threads.