Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement vblock split write amp reduction algo #582

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

tristan957
Copy link
Collaborator

During a vblock split, we can save write amp by using mpool_mblock_punch() to FALLOC_FL_PUNCH_HOLE a portion of the mblock for both the left and right hand side destination mblocks.

Signed-off-by: Tristan Partin tpartin@micron.com
Co-authored-by: Nabeel Meeramohideen Mohamed nmeeramohide@micron.com

perfc_rwc++;
if (perfc_ison(pc, PERFC_RA_CNCOMP_RBYTES) || perfc_ison(pc, PERFC_RA_CNCOMP_WBYTES)) {
struct mblock_props props;
log_debug("Cloned mblock (0x%" PRIx64 ") starting at offset %ld", src_mbid, off);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these for my debugging purposes, but they could probably be removed or changed. Open to thoughts.

bool overlaps;
bool accessed;
uint16_t vblk_idx;
uint offset;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a uint because this is what I get back from the kvset_iter_next_vref API.

@tristan957
Copy link
Collaborator Author

tristan957 commented Jan 11, 2023

Workload I used for testing:

kmt -j64 -i2g -Rx -s1 -f %023lu -l977 -okeydist=0 /mnt/hse kvs1 kvs-oparams cn_dsplit_size=32

With this, I notice a 4.68% reduction in KVDB size at the end of the run.

@beckerg
Copy link
Contributor

beckerg commented Jan 11, 2023

You should also have seen no gc operations...

@tristan957
Copy link
Collaborator Author

Greg how could I double check that? Is there a log message I should look out for?

lib/cn/kvset_split.c Outdated Show resolved Hide resolved
@tristan957 tristan957 force-pushed the tristan957/nabeel-algo branch 5 times, most recently from 6625ce7 to 4d35f54 Compare January 19, 2023 05:19
@tristan957 tristan957 marked this pull request as draft January 19, 2023 18:33
@tristan957 tristan957 force-pushed the tristan957/nabeel-algo branch 2 times, most recently from b0e3e43 to 5d57b45 Compare January 19, 2023 21:46
Copy link
Contributor

@gsramdasi gsramdasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good!
But I feel the vblocks_split() function warrants a comment at the top briefly explaining the algorithm.

@tristan957
Copy link
Collaborator Author

It doesn't work though 😕

@tristan957
Copy link
Collaborator Author

Ok, so here are the facts. If I am remembering correctly from when I was last touching this PR, you can either end up in 1 of 2 situations. One is a segfault due to a null pointer dereference, but I couldn't tell you what the backtrace is. The other situation is the following:

kmt: lib/cn/kvset_split.c:781: vblocks_split: Assertion `blks_right->bl_vtotal >= blks_right->bl_vused' failed.
Process 29455 stopped
* thread #19, name = 'hse_cn_io', stop reason = signal SIGABRT
    frame #0: 0x00007ffff7b92e5c libc.so.6`__pthread_kill_implementation + 268
libc.so.6`__pthread_kill_implementation:
->  0x7ffff7b92e5c <+268>: movl   %eax, %ebp
    0x7ffff7b92e5e <+270>: negl   %ebp
    0x7ffff7b92e60 <+272>: cmpl   $0xfffff000, %eax         ; imm = 0xFFFFF000
    0x7ffff7b92e65 <+277>: movl   $0x0, %eax
(lldb) bt
* thread #19, name = 'hse_cn_io', stop reason = signal SIGABRT
  * frame #0: 0x00007ffff7b92e5c libc.so.6`__pthread_kill_implementation + 268
    frame #1: 0x00007ffff7b42a76 libc.so.6`raise + 22
    frame #2: 0x00007ffff7b2c7fc libc.so.6`abort + 215
    frame #3: 0x00007ffff7b2c71b libc.so.6`__assert_fail_base.cold + 15
    frame #4: 0x00007ffff7b3b656 libc.so.6`__assert_fail + 70
    frame #5: 0x00000000005fb613 kmt`vblocks_split(ks=0x00007ff613fec4c0, split_key=0x00007eef78bc24f0, work=0x00007fff7bbb7670, pc=0x0000000000887b28, result=0x00007ef378019b60) at kvset_split.c:781:5
    frame #6: 0x00000000005fbf73 kmt`kvset_split(ks=0x00007ff613fec4c0, split_key=0x00007eef78bc24f0, pc=0x0000000000887b28, result=0x00007ef378019b60) at kvset_split.c:880:11
    frame #7: 0x00000000005fc09b kmt`kvset_split_worker(work=0x00007ef378019b20) at kvset_split.c:904:18
    frame #8: 0x00000000004b6d1c kmt`worker_thread(arg=0x000000000083b280) at workqueue.c:456:13
    frame #9: 0x00007ffff7b9112d libc.so.6`start_thread + 717
    frame #10: 0x00007ffff7c12bc0 libc.so.6`__clone3 + 48

My steps for reproducing:

$ ./build/cli/hse kvdb create /mnt/hse
$ ./build/cli/hse kvs create /mnt/hse kvs1
$ lldb -- ./build/tools/kmt -j64 -i2g -Rx -s1 -f%023lu -l997 -okeydist=0 /mnt/hse kvs1 kvdb-oparams cn_io_threads=1 csched_debug_mask=$((0x80)) kvs-oparams cn_split_size=19 cn_dsplit_size=19

Been a while since I tried this workflow with gdb since I kept hitting an internal gdb bug. I am using a debug build, and the run will usually fail at >= 3.10 space amp, so pretty far into the load.

@gsramdasi
Copy link
Contributor

And this is the backtrace in the other scenario:

Thread 19 "hse_cn_io" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff7bf77000 (LWP 382640)]
0x000000000057fd92 in mbset_get_udata (blk_num=0, self=0x0) at ../../../home/gramdasi/hse-master/hse/lib/cn/mbset.h:111
111         bool valid = blk_num < self->mbs_mblkc;
Missing separate debuginfos, use: dnf debuginfo-install cyrus-sasl-lib-2.1.27-14.fc35.x86_64 glibc-2.34-7.fc35.x86_64 keyutils-libs-1.6.1-3.fc35.x86_64 krb5-libs-1.19.2-2.fc35.x86_64 libbsd-0.10.0-8.fc35.x86_64 libcom_err-1.46.3-1.fc35.x86_64 libgcc-11.2.1-1.fc35.x86_64 libselinux-3.2-4.fc35.x86_64 libstdc++-11.2.1-1.fc35.x86_64 libxcrypt-4.4.26-4.fc35.x86_64 libzstd-1.5.0-2.fc35.x86_64 openssl-libs-1.1.1l-2.fc35.x86_64 pcre2-10.37-4.fc35.x86_64 snappy-1.1.9-3.fc35.x86_64 userspace-rcu-0.13.0-3.fc35.x86_64 zlib-1.2.11-30.fc35.x86_64
(gdb) bt
#0  0x000000000057fd92 in mbset_get_udata (blk_num=0, self=0x0) at ../../../home/gramdasi/hse-master/hse/lib/cn/mbset.h:111
#1  lvx2vbd (i=104, ks=0x7fffec25a940) at ../../../home/gramdasi/hse-master/hse/lib/cn/kvset.c:138
#2  kvset_get_nth_vblock_alen (ks=0x7fffec25a940, index=104) at ../../../home/gramdasi/hse-master/hse/lib/cn/kvset.c:1767
#3  0x00000000005fd516 in vblocks_split (ks=0x7fffec25a940, split_key=0x7ef049fbd570, work=0x7fff7bf326f0, pc=0x8bf358, result=0x7ef02c05e100) at ../../../home/gramdasi/hse-master/hse/lib/cn/kvset_split.c:750
#4  0x00000000005fe164 in kvset_split (ks=0x7fffec25a940, split_key=0x7ef049fbd570, pc=0x8bf358, result=0x7ef02c05e100) at ../../../home/gramdasi/hse-master/hse/lib/cn/kvset_split.c:880
#5  0x00000000005fe28c in kvset_split_worker (work=0x7ef02c05e0c0) at ../../../home/gramdasi/hse-master/hse/lib/cn/kvset_split.c:904
#6  0x00000000004b8b62 in worker_thread (arg=0x872b00) at ../../../home/gramdasi/hse-master/hse/lib/util/lib/workqueue.c:456
#7  0x00007ffff7be6b17 in start_thread () from /lib64/libc.so.6
#8  0x00007ffff7c6b6a0 in clone3 () from /lib64/libc.so.6

@tristan957
Copy link
Collaborator Author

I have some pretty spammy logs in this diff. So what you'll see when you encounter either the segfault or the assertion failure is something like the following:

left: punched id=0x30703400159 vgidx=20 vblkidx=256 alen=4096

The problematic portion being the 4096 alen, which is most likely completely wrong. Following ancestries of blocks like this you can see that pretty much the entire block is a hole due cloning at an offset and then punching. Right now, I am trying to understand if the min and max keys of the vblocks need to change after cloning/punching.

tristan957 and others added 3 commits February 13, 2023 17:07
During a vblock split, we can save write amp by using
mpool_mblock_punch() to FALLOC_FL_PUNCH_HOLE a portion of the mblock for
both the left and right hand side destination mblocks.

Signed-off-by: Tristan Partin <tpartin@micron.com>
Co-authored-by: Nabeel Meeramohideen Mohamed <nmeeramohide@micron.com>
@tristan957
Copy link
Collaborator Author

tristan957 commented Feb 14, 2023

hsettp column headers

T NODE IDX DGEN COMP KEYS TOMBS PTOMBS HWLEN KWLEN VWLEN VGARB HBLKS KBLKS VBLKS VGRPS RULE STATE...
tristan957/nabeel-algo t  193 834 1123    3   2.1g     0      0  20.4m  43.9g   2.2t  52.2m   834  1900 74390 12512      - - 1 kvs1 1.002
master                 t  193 957 1122    3   2.1g     0      0  23.5m  43.8g   2.3t 161.0g   957  2045 74803 12937      - - 1 kvs1 1.077

I believe last column is space amp (Greg and unstructured data 😄). Doing some quick math would indicate that vgarb had a 99.07% reduction with this patch using the aforementioned workload.

[0 0 09:44:10] [tpartin@campise] [hse] [tristan957/nabeel-algo=]
+ $ du /mnt/hse
2146631940      /mnt/hse/capacity
2146631944      /mnt/hse
[0 0 02:38:09] [tpartin@campise] [hse] [master=]
+ $ du /mnt/hse
2298648080      /mnt/hse/capacity
2298648084      /mnt/hse

@tristan957
Copy link
Collaborator Author

Looking at du output, something seems off with the reported stats vs actual.

@beckerg
Copy link
Contributor

beckerg commented Feb 14, 2023

yeah, i was having issues with that when running db-bench... what stats are you looking at? in general, i don't think our stats include the garbage, which is why i asked nabeel to create the vgarb stat..

sadly, our stats are reported in SI units, and du is reported in powers of two, does that make a difference?
n/m, you're looking at byte counts...

@tristan957
Copy link
Collaborator Author

Well more debugging to do...sigh

@tristan957
Copy link
Collaborator Author

tristan957 commented Mar 19, 2023

This PR has at least two issues that I think were documented in JIRA, but since we no longer have JIRA here we are :).

  1. mblocks should not be punched in vblocks_split(), but instead after the entire split operation has finished. If an mblock is punched and then some sort of failure occurs before the split has completed, the KVDB will be unable to open.
  2. The min and max keys of vblocks have to change. In the event a vblock gets punched/cloned, we change either the max or min key which references the vblock respectively. Since we don't change the bounds of the vblock, in another split operation, the same vblock may be inadvertently chosen to become a punch/clone candidate again because the bounds of the vblock incorrectly indicate that it overlaps the split key.

Issue 1 manifests itself if you abort at the end of the vblocks_split() function. Issue 2 is harder to notice, but you will fail because total bytes is <= used bytes of the vblock. You would also notice a vblock has an alen of a single page (4096).

I think @gsramdasi noticed one other issue similar to how issue 2 manifests even when a vblock was going through its first split. This might be an indicator of a third undiagnosed issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants