New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations in notify-one #12545
base: main
Are you sure you want to change the base?
Optimizations in notify-one #12545
Conversation
Hi @casualwind! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Hello, I have assigned the CLA, and this is my first PR to RocksDB. Thank you for any review! |
@ajkr Could you or your colleagues help to review my pr? I think this kind of modification may help to the performance on high core servers. If you have any questions please let me know, thank you. |
Hi thanks for the PR. The change makes sense to me. I wonder if you have results for the performance improvement for each step (1. Parallelize SetState vs 2. Wake up non-STATE_LOCKED_WAITING first). I did some benchmark with fillrandom and I saw 1. improves performance but 2. does not show much improvement.
|
@cbi42 Thank you for your review! We have tested the performance before(default --enable_pipelined_write=1), and the step2‘s impact on performance is indeed nearly to none. We put them together because in our previous tests ( the main branch is not the newest), when --enable_pipelined_write=0, step2+step1 got better performance than when there is only step1. Now we tested them on the latest main branch, step2 does have no impact on performance at any time. Currently, we only need the step 1 rather than step1+step2. |
@cbi42 there are 2 commands:
We run command 1: stdev/average is nearly 25% We wonder why their gap so much? |
I just realize that fillrandom[-X5] will reuse the same DB, so it's probably better to just run fillrandom without the [-X5]. |
@cbi42 Thank you for clearing up our doubts. |
db/write_thread.cc
Outdated
|
||
// The minimum number to allow the group use parallel caller mode. | ||
// The number must no lower than 3; | ||
const size_t MinParallelSize = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider setting a higher MinParallelSize
to avoid performance regression. From the graph in summary, it seems the optimization helps when there's close to 40 threads. With ./db_bench --benchmarks=fillrandom[-X1] --threads=40 --seed=1708494134896523 --duration=10 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=1 --disable_wal=0 --enable_pipelined_write=1 --num=100000000 --batch_size=1
and logging write group size in statistics, I got the following write group size distribution:
P50 : 17.658324 P95 : 29.367381 P99 : 35.859741 P100 : 40.000000 COUNT : 650853 SUM : 10974960
. Maybe we can set the threshold to around 20?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The threshold has been changed to 20, I will check the one-pager of core-scaling score with the newest version.
55e2bf6
to
af1fac6
Compare
LGTM, could you update the PR summary with only step 1 and benchmark, and add a change log under https://github.com/facebook/rocksdb/tree/main/unreleased_history/performance_improvements? |
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@casualwind has updated the pull request. You must reimport the pull request before landing. |
@cbi42 We have updated the PR summary and added a change log. |
Hi - for PR summary, I still see two optimizations listed, it should only have the first one now? |
@@ -0,0 +1,17 @@ | |||
# Parallelize SetState in LaunchParallelMemTableWriters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind giving a shorter summary of the improvement, like improved write throughput to memtable when there's a large number of concurrent writers and allow_concurrent_memtable_write=true? You can add the PR number to the changelog so people can read about the implementation detail there. See more examples in https://github.com/facebook/rocksdb/blob/main/HISTORY.md
@casualwind has updated the pull request. You must reimport the pull request before landing. |
We found that for writers s in STATE_LOCKED_WAITING, the notify-one function needs to be called, and the cost of calling this function is very high especially when there are many writers that need to be awakened. So, we Parallelize this progress. To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn. vcpu=160, benchmark=db_bench The score is normalized: | case name | optimized/base | |-------------------|----------------| | fillrandom | 182% | | fillseq | 184% | | fillsync | 136% | | overwrite | 179% | | randomreplacekeys | 180% | | randomtransaction | 161% | | updaterandom | 163% | | xorupdaterandom | 165% |
@cbi42 I misunderstood the "update the PR summary" before, and it has been revised. For the changelog, the shorter summary and the PR number has been added as well. Please check whether this modification suitable. Thank you. |
|
||
To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn. | ||
|
||
--- | ||
|
||
vcpu=160, benchmark=db_bench | ||
The score is normalized: | ||
| case name | optimized/base | | ||
|-------------------|----------------| | ||
| fillrandom | 182% | | ||
| fillseq | 184% | | ||
| fillsync | 136% | | ||
| overwrite | 179% | | ||
| randomreplacekeys | 180% | | ||
| randomtransaction | 161% | | ||
| updaterandom | 163% | | ||
| xorupdaterandom | 165% | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest removing this part since they are available in the PR summary. Otherwise LGTM.
@@ -0,0 +1,18 @@ | |||
# Improved write throughput to memtable when there's a large number of concurrent writers and allow_concurrent_memtable_write=true(#12545) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: # is not needed. We use unreleased_history/release.sh to generate release note which will just prepend *
for this changelog:
rocksdb/unreleased_history/README.txt
Lines 25 to 28 in c72ee45
The file should usually contain one line of markdown, and "* " is not | |
required, as it will automatically be inserted later if not included at the | |
start of the first line in the file. Extra newlines or missing trailing | |
newlines will also be corrected. |
|
||
// The stride is the same for each writer in write_group, so w will | ||
// call the writers with the same number in write_group mod total size | ||
size_t stride = static_cast<size_t>(sqrt(write_group->size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal linter complaint: "Unqualified sqrt call may result in a call to the C library function with the argument implicitly converted to double, leading to worse performance. If you are not calling the C library function consider qualifying the call with the namespace.Otherwise, use std::sqrt instead"
// the total number of leader SetSate. | ||
// Set the leader itself STATE_PARALLEL_MEMTABLE_WRITER, and set | ||
// (stride-1) writers to be STATE_PARALLEL_MEMTABLE_CALLER. | ||
size_t stride = static_cast<size_t>(sqrt(group_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
We tested on icelake server (vcpu=160). The default configuration is allow_concurrent_memtable_write=1, thread number =activate core number. With our optimizations, the improvement can reach up to 184% in fillseq case. op/s is as the performance indicator in db_bench, and the following are performance improvements in some cases in db_bench.
With analysis, we find that although the process of writing memtable is processed in parallel, the process of waking up the writers is not processed in parallel, which means that only one writers is responsible for the sequential waking up other writers. The following is our method to optimize this process.
Assume that there are currently n threads in total, we parallelize SetState in LaunchParallelMemTableWriters. To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn.
A reproduction script:
./db_bench --benchmarks="fillrandom" --threads ${number of all activate vcpu} --seed 1708494134896523 --duration 60