Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention. #308

Open
wants to merge 3 commits into
base: unstable
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
36 changes: 29 additions & 7 deletions src/zmalloc.c
Expand Up @@ -52,6 +52,7 @@ void zlibc_free(void *ptr) {
#include <string.h>
#include "zmalloc.h"
#include "atomicvar.h"
#include "serverassert.h"

#define UNUSED(x) ((void)(x))

Expand Down Expand Up @@ -87,10 +88,30 @@ void zlibc_free(void *ptr) {
#define dallocx(ptr,flags) je_dallocx(ptr,flags)
#endif

#define update_zmalloc_stat_alloc(__n) atomicIncr(used_memory,(__n))
#define update_zmalloc_stat_free(__n) atomicDecr(used_memory,(__n))
#define update_zmalloc_stat_alloc(__n) do { \
if (unlikely(thread_index == -1)) zmalloc_register_thread_index(); \
used_memory_thread[thread_index] += (__n); \
} while (0)

#define update_zmalloc_stat_free(__n) do { \
if (unlikely(thread_index == -1)) zmalloc_register_thread_index(); \
used_memory_thread[thread_index] -= (__n); \
} while (0)

/* A thread-local storage which keep the current thread's index in the
* used_memory_thread array. */
static __thread int thread_index = -1;
/* MAX_THREADS_NUM = IO_THREADS_MAX_NUM(128) + BIO threads(3) + main thread(1). */
#define MAX_THREADS_NUM 132
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#define MAX_THREADS_NUM 132
#define MAX_THREADS_NUM (IO_THREADS_MAX_NUM + BIO_THREAD_COUNT + 1)

Can we do something like this? I think we need to implement a new define for BIO thread count. Then we don't have to worry about this changing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IO_THREADS_MAX_NUM is defined in networking.c and BIO_THREAD_COUNT has a similar define BIO_WORKER_NUM in bio.c. How do we reuse the define in the zmalloc.c ? It is a very low level API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there is a layering concern here. Can you add a c_assert to make sure the condition of (IO_THREADS_MAX_NUM + BIO_THREAD_COUNT + 1) < IO_THREADS_MAX_NUM is not violated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, if we go with this route, we should probably introduce a new header to host all thread local storage related definition. It feels a bit out of place to define MAX_THREADS_NUM in zmalloc.c. We might have other consumers of this macro in the future. I wonder if we should call the new header system.h, which would be at the lowest layer.

static unsigned long long used_memory_thread[MAX_THREADS_NUM];

static serverAtomic int total_active_threads;

static serverAtomic size_t used_memory = 0;
/* Register the thread index in start_routine. */
void zmalloc_register_thread_index(void) {
atomicGetIncr(total_active_threads, thread_index, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
atomicGetIncr(total_active_threads, thread_index, 1);
serverAssert(total_active_threads < MAX_THREADS_NUM);
atomicGetIncr(total_active_threads, thread_index, 1);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both total_active_threads and thread_index are of size_t, whose sizes match the native word size of the platform, the proposed change should work.

Strictly speaking though (if size_t was 64 bit on a 32-bit platform), L112 and L113 need to be flipped. We also shouldn't touch the global var total_active_threads without the atomic operation; instead, we should access the thread local var thread_index.

Suggested change
atomicGetIncr(total_active_threads, thread_index, 1);
atomicGetIncr(total_active_threads, thread_index, 1);
serverAssert(thread_index < MAX_THREADS_NUM);

assert(total_active_threads < MAX_THREADS_NUM);
}

static void zmalloc_default_oom(size_t size) {
fprintf(stderr, "zmalloc: Out of memory trying to allocate %zu bytes\n",
Expand Down Expand Up @@ -409,8 +430,11 @@ char *zstrdup(const char *s) {
}

size_t zmalloc_used_memory(void) {
size_t um;
atomicGet(used_memory,um);
assert(total_active_threads < MAX_THREADS_NUM);
size_t um = 0;
for (int i = 0; i < total_active_threads; i++) {
Copy link

@qlong qlong Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering how important is it for zmalloc_used_memory to return the correct used_memory? As we loop through to add up the memory used by each thread, the threads that had been counted may continue to allocate or to free memory, so the sum might under or over account the actual memory used by all threads.

I don't think this is an issue if the function is only used for metrics, but overMaxmemoryAfterAlloc (evict.c) is not for metrics, we probably should ensure (and call out) that undercounting memory usage would not be an issue there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qlong I understand your concern, at the beginning, I want to keep an approximate value for used_memory, each thread keep its own used_memory_thread, only used_memory_thread greater than a threshold(e.g. 1k) then update to global atomic used_memory, the pseudocode will be like this, but I saw the test case in https://github.com/valkey-io/valkey/blob/unstable/src/zmalloc.c#L917, seems it need an accurate used_memory? Back to this PR, each element in used_memory_thread array only have one writer to update the value, when read, the reader may have gap between oldvalue and newvalue, depends on the read timing.
@madolson What's your thought?

-#define update_zmalloc_stat_alloc(__n) atomicIncr(used_memory,(__n))
-#define update_zmalloc_stat_free(__n) atomicDecr(used_memory,(__n))
+#define update_zmalloc_stat_alloc(__n) do { \
+    used_memory_thread += (__n); \
+    if (used_memory_thread > 1024) atomicIncr(used_memory, used_memory_thread); \
+} while (0) \
+
+#define update_zmalloc_stat_free(__n) do { \
+    used_memory_thread -= (__n); \
+    if (used_memory_thread > 1024) atomicDecr(used_memory, used_memory_thread); \
+} while (0) \

 static serverAtomic size_t used_memory = 0;
+static __thread size_t used_memory_thread = 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few quick thoughts

  1. I see value in removing atomic operations
  2. Since we use size_t here for the counters, which should be natively aligned as well, the worst we can get would be staleness as opposed to inconsistency.
  3. The absolute accuracy doesn't have much value for metrics. The value will get stale immediately after the INFO command returns, whether it is absolutely accurate or not. As long as the value is close enough I think we are good. We can change the tests
  4. Agreed that overMaxmemoryAfterAlloc is a higher bar than metrics. My gut feeling is that the staleness brought by the relaxed memory access should still be negligible. I doubt we could accumulate a lot of changes before the CPU flushes out its cache to the main memory. Curious to know if anyone has more definitive information.
  5. The absolute accuracy requirement is directly at odds with the performance overhead incurred by the atomic operation. So one of them has to go.

Copy link
Contributor Author

@lipzhu lipzhu Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we all had concern about the absolute accuracy of used_memory in overMaxmemoryAfterAlloc, IMO, this is not a problem, because in the definition of overMaxmemoryAfterAlloc , the comparison between moremem and used_memory (Their semantics are different) is not absolute accuracy if we consider the gap between requested memory and actual allocated memory from allocator.

int overMaxmemoryAfterAlloc(size_t moremem) {
    if (!server.maxmemory) return  0; /* No limit. */

    /* Check quickly. */
    size_t mem_used = zmalloc_used_memory();
    if (mem_used + moremem <= server.maxmemory) return 0;

    size_t overhead = freeMemoryGetNotCountedMemory();
    mem_used = (mem_used > overhead) ? mem_used - overhead : 0;
    return mem_used + moremem > server.maxmemory;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly ping @PingXie , any comments about this, do we need to continue this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a way to address this concern while retaining the most of the performance benefit if we keep the existing io-threading model. The idea would be to keep a mirrored per-thread counter array such that each thread periodically "flushes" its thread local counter value to the corresponding element in the global per-thread counter array using atomic operations. This would work with the current io-threading model where io-threads and main threads run in lock-steps. Every io thread would need to push its local value to this global array when it is done with its work, but only for once. Then by the time the main thread starts executing, it would have the accurate data. We can reduce the number of atomic operations to one per io-thread per epoll session.

We would have to revisit this if we went for a different threading model (such as #22) .

That said, I am still not sure how accurate this information needs to be. zmalloc_used_memory doesn't represent the RSS usage of the process anyways. I think it would be really helpful if we could somehow establish the worst possible case, empirically or even a SWAG.

Copy link
Contributor Author

@lipzhu lipzhu May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I am still not sure how accurate this information needs to be. zmalloc_used_memory doesn't represent the RSS usage of the process anyways.

IMO, there is no meaningful of the absolute accurate zmalloc_used_memory, to achieve the goal, the cost is very high, not only the expensive atomic operation in every zmalloc/zfree operation, but also the call of je_malloc_usable_size is very high (From the cycles hotspot snapshot in top comment, you can find the cycles ratio of je_malloc_usable_size is the top 1 hotspot). If we can accept the a little latency of zmalloc_used_memory, ideally we may have at least 10% perf gain.

@PingXie Do you think we need open an issue to discuss with the rest of @valkey-io/core-team of the necessary of absolute accurate zmalloc_used_memory ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw you opened #467. Thanks @lipzhu.

um += used_memory_thread[i];
lipzhu marked this conversation as resolved.
Show resolved Hide resolved
}
return um;
}

Expand Down Expand Up @@ -628,8 +652,6 @@ size_t zmalloc_get_rss(void) {

#if defined(USE_JEMALLOC)

#include "serverassert.h"

#define STRINGIFY_(x) #x
#define STRINGIFY(x) STRINGIFY_(x)

Expand Down
1 change: 1 addition & 0 deletions src/zmalloc.h
Expand Up @@ -138,6 +138,7 @@ size_t zmalloc_get_memory_size(void);
void zlibc_free(void *ptr);
void zlibc_trim(void);
void zmadvise_dontneed(void *ptr);
void zmalloc_register_thread_index(void);

#ifdef HAVE_DEFRAG
void zfree_no_tcache(void *ptr);
Expand Down