Optimizations in notify-one #12545

casualwind · 2024-04-16T08:03:07Z

We tested on icelake server (vcpu=160). The default configuration is allow_concurrent_memtable_write=1, thread number =activate core number. With our optimizations, the improvement can reach up to 184% in fillseq case. op/s is as the performance indicator in db_bench, and the following are performance improvements in some cases in db_bench.

case name	optimized/original
fillrandom	182%
fillseq	184%
fillsync	136%
overwrite	179%
randomreplacekeys	180%
randomtransaction	161%
updaterandom	163%
xorupdaterandom	165%

With analysis, we find that although the process of writing memtable is processed in parallel, the process of waking up the writers is not processed in parallel, which means that only one writers is responsible for the sequential waking up other writers. The following is our method to optimize this process.

Assume that there are currently n threads in total, we parallelize SetState in LaunchParallelMemTableWriters. To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn.

A reproduction script:
./db_bench --benchmarks="fillrandom" --threads ${number of all activate vcpu} --seed 1708494134896523 --duration 60

facebook-github-bot · 2024-04-16T08:03:12Z

Hi @casualwind!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

casualwind · 2024-04-25T05:06:30Z

Hello, I have assigned the CLA, and this is my first PR to RocksDB. Thank you for any review!

casualwind · 2024-05-06T06:38:24Z

@ajkr Could you or your colleagues help to review my pr? I think this kind of modification may help to the performance on high core servers. If you have any questions please let me know, thank you.

cbi42 · 2024-05-10T22:41:49Z

Hi thanks for the PR. The change makes sense to me. I wonder if you have results for the performance improvement for each step (1. Parallelize SetState vs 2. Wake up non-STATE_LOCKED_WAITING first). I did some benchmark with fillrandom and I saw 1. improves performance but 2. does not show much improvement.

Command:
./db_bench --benchmarks=fillrandom[-X5] --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0

main:
fillrandom [AVG    5 runs] : 130401 (± 535) ops/sec;   14.4 (± 0.1) MB/sec
fillrandom [MEDIAN 5 runs] : 130207 ops/sec;   14.4 MB/sec

this PR: 
fillrandom [AVG    5 runs] : 282247 (± 70009) ops/sec;   31.2 (± 7.7) MB/sec
fillrandom [MEDIAN 5 runs] : 317884 ops/sec;   35.2 MB/sec

this PR without 2.
fillrandom [AVG    5 runs] : 281672 (± 70145) ops/sec;   31.2 (± 7.8) MB/sec
fillrandom [MEDIAN 5 runs] : 317164 ops/sec;   35.1 MB/sec

casualwind · 2024-05-13T07:06:31Z

@cbi42 Thank you for your review!
for each step :1. Parallelize SetState 2. Wake up non-STATE_LOCKED_WAITING first

We have tested the performance before(default --enable_pipelined_write=1), and the step2‘s impact on performance is indeed nearly to none.

We put them together because in our previous tests ( the main branch is not the newest), when --enable_pipelined_write=0, step2+step1 got better performance than when there is only step1. Now we tested them on the latest main branch, step2 does have no impact on performance at any time.

Currently, we only need the step 1 rather than step1+step2.

casualwind · 2024-05-13T07:13:45Z

@cbi42
We have a doubt besides.

there are 2 commands:

Command 1:
./db_bench --benchmarks=fillrandom[-X5] --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0

Command 2:
./db_bench --benchmarks=fillrandom --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0

We run command 1: stdev/average is nearly 25%
We run command 2 for 5 times: stdev/average is under 1%.

We wonder why their gap so much?

cbi42 · 2024-05-14T00:24:24Z

We wonder why their gap so much?

@cbi42 We have a doubt besides.

there are 2 commands:
Command 1:
./db_bench --benchmarks=fillrandom[-X5] --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0

Command 2:
./db_bench --benchmarks=fillrandom --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0
We run command 1: stdev/average is nearly 25% We run command 2 for 5 times: stdev/average is under 1%.

We wonder why their gap so much?

I just realize that fillrandom[-X5] will reuse the same DB, so it's probably better to just run fillrandom without the [-X5].

casualwind · 2024-05-14T05:21:27Z

@cbi42 Thank you for clearing up our doubts.

cbi42 · 2024-05-15T00:09:06Z

db/write_thread.cc

+
+  // The minimum number to allow the group use parallel caller mode.
+  // The number must no lower than 3;
+  const size_t MinParallelSize = 5;


Consider setting a higher MinParallelSize to avoid performance regression. From the graph in summary, it seems the optimization helps when there's close to 40 threads. With ./db_bench --benchmarks=fillrandom[-X1] --threads=40 --seed=1708494134896523 --duration=10 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=1 --disable_wal=0 --enable_pipelined_write=1 --num=100000000 --batch_size=1 and logging write group size in statistics, I got the following write group size distribution:
P50 : 17.658324 P95 : 29.367381 P99 : 35.859741 P100 : 40.000000 COUNT : 650853 SUM : 10974960. Maybe we can set the threshold to around 20?

The threshold has been changed to 20, I will check the one-pager of core-scaling score with the newest version.

cbi42 · 2024-05-16T03:56:38Z

LGTM, could you update the PR summary with only step 1 and benchmark, and add a change log under https://github.com/facebook/rocksdb/tree/main/unreleased_history/performance_improvements?

facebook-github-bot · 2024-05-16T03:57:15Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-05-17T06:52:58Z

@casualwind has updated the pull request. You must reimport the pull request before landing.

casualwind · 2024-05-17T06:53:58Z

@cbi42 We have updated the PR summary and added a change log.

cbi42 · 2024-05-20T19:45:36Z

@cbi42 We have updated the PR summary and added a change log.

Hi - for PR summary, I still see two optimizations listed, it should only have the first one now?

cbi42 · 2024-05-20T19:49:32Z

...ased_history/performance_improvements/parallel_notify-one_i_LaunchParallelMemTableWriters.md

@@ -0,0 +1,17 @@
+# Parallelize SetState in LaunchParallelMemTableWriters


Do you mind giving a shorter summary of the improvement, like improved write throughput to memtable when there's a large number of concurrent writers and allow_concurrent_memtable_write=true? You can add the PR number to the changelog so people can read about the implementation detail there. See more examples in https://github.com/facebook/rocksdb/blob/main/HISTORY.md

facebook-github-bot · 2024-05-22T07:29:42Z

@casualwind has updated the pull request. You must reimport the pull request before landing.

We found that for writers s in STATE_LOCKED_WAITING, the notify-one function needs to be called, and the cost of calling this function is very high especially when there are many writers that need to be awakened. So, we Parallelize this progress. To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn. vcpu=160, benchmark=db_bench The score is normalized: | case name | optimized/base | |-------------------|----------------| | fillrandom | 182% | | fillseq | 184% | | fillsync | 136% | | overwrite | 179% | | randomreplacekeys | 180% | | randomtransaction | 161% | | updaterandom | 163% | | xorupdaterandom | 165% |

casualwind · 2024-05-22T07:39:16Z

@cbi42 I misunderstood the "update the PR summary" before, and it has been revised. For the changelog, the shorter summary and the PR number has been added as well. Please check whether this modification suitable. Thank you.

cbi42 · 2024-05-22T23:20:18Z

...ased_history/performance_improvements/parallel_notify-one_i_LaunchParallelMemTableWriters.md

+
+To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn.
+
+---
+
+vcpu=160, benchmark=db_bench
+The score is normalized:
+| case name         | optimized/base |
+|-------------------|----------------|
+| fillrandom        | 182%           |
+| fillseq           | 184%           |
+| fillsync          | 136%           |
+| overwrite         | 179%           |
+| randomreplacekeys | 180%           |
+| randomtransaction | 161%           |
+| updaterandom      | 163%           |
+| xorupdaterandom   | 165%           |


I'd suggest removing this part since they are available in the PR summary. Otherwise LGTM.

cbi42 · 2024-05-22T23:25:11Z

...ased_history/performance_improvements/parallel_notify-one_i_LaunchParallelMemTableWriters.md

@@ -0,0 +1,18 @@
+# Improved write throughput to memtable when there's a large number of concurrent writers and allow_concurrent_memtable_write=true(#12545)


nit: # is not needed. We use unreleased_history/release.sh to generate release note which will just prepend * for this changelog:

rocksdb/unreleased_history/README.txt

Lines 25 to 28 in c72ee45

The file should usually contain one line of markdown, and "* " is not

required, as it will automatically be inserted later if not included at the

start of the first line in the file. Extra newlines or missing trailing

newlines will also be corrected.

cbi42 · 2024-05-22T23:32:41Z

db/write_thread.cc

+
+  // The stride is the same for each writer in write_group, so w will
+  // call the writers with the same number in write_group mod total size
+  size_t stride = static_cast<size_t>(sqrt(write_group->size));


internal linter complaint: "Unqualified sqrt call may result in a call to the C library function with the argument implicitly converted to double, leading to worse performance. If you are not calling the C library function consider qualifying the call with the namespace.Otherwise, use std::sqrt instead"

cbi42 · 2024-05-22T23:32:58Z

db/write_thread.cc

+  // the total number of leader SetSate.
+  // Set the leader itself STATE_PARALLEL_MEMTABLE_WRITER, and set
+  // (stride-1) writers to be STATE_PARALLEL_MEMTABLE_CALLER.
+  size_t stride = static_cast<size_t>(sqrt(group_size));


casualwind changed the title ~~Thread opt~~ Optimizations in notify-one to improve the performance Apr 16, 2024

casualwind changed the title ~~Optimizations in notify-one to improve the performance~~ Optimizations in notify-one to Apr 16, 2024

casualwind changed the title ~~Optimizations in notify-one to~~ Optimizations in notify-one Apr 16, 2024

facebook-github-bot added the CLA Signed label Apr 23, 2024

casualwind force-pushed the thread-opt branch from ab5224e to ebbf892 Compare April 24, 2024 02:12

casualwind force-pushed the thread-opt branch from ebbf892 to e3959f7 Compare April 28, 2024 01:35

casualwind force-pushed the thread-opt branch from e3959f7 to 6ef77c6 Compare May 13, 2024 07:54

cbi42 reviewed May 15, 2024

View reviewed changes

casualwind force-pushed the thread-opt branch 2 times, most recently from 55e2bf6 to af1fac6 Compare May 15, 2024 08:39

casualwind force-pushed the thread-opt branch from af1fac6 to 6f20406 Compare May 17, 2024 06:52

cbi42 reviewed May 20, 2024

View reviewed changes

casualwind force-pushed the thread-opt branch from 6f20406 to a3ae292 Compare May 22, 2024 07:29

cbi42 reviewed May 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations in notify-one #12545

Optimizations in notify-one #12545

casualwind commented Apr 16, 2024 •

edited

facebook-github-bot commented Apr 16, 2024

casualwind commented Apr 25, 2024

casualwind commented May 6, 2024

cbi42 commented May 10, 2024

casualwind commented May 13, 2024 •

edited

casualwind commented May 13, 2024

cbi42 commented May 14, 2024

casualwind commented May 14, 2024

cbi42 May 15, 2024

casualwind May 15, 2024 •

edited

cbi42 commented May 16, 2024

facebook-github-bot commented May 16, 2024

facebook-github-bot commented May 17, 2024

casualwind commented May 17, 2024

cbi42 commented May 20, 2024

cbi42 May 20, 2024

facebook-github-bot commented May 22, 2024

casualwind commented May 22, 2024

cbi42 May 22, 2024

cbi42 May 22, 2024

cbi42 May 22, 2024

cbi42 May 22, 2024

		@@ -0,0 +1,17 @@
		# Parallelize SetState in LaunchParallelMemTableWriters

		@@ -0,0 +1,18 @@
		# Improved write throughput to memtable when there's a large number of concurrent writers and allow_concurrent_memtable_write=true(#12545)

	The file should usually contain one line of markdown, and "* " is not
	required, as it will automatically be inserted later if not included at the
	start of the first line in the file. Extra newlines or missing trailing
	newlines will also be corrected.

Optimizations in notify-one #12545

Are you sure you want to change the base?

Optimizations in notify-one #12545

Conversation

casualwind commented Apr 16, 2024 • edited

facebook-github-bot commented Apr 16, 2024

Action Required

Process

casualwind commented Apr 25, 2024

casualwind commented May 6, 2024

cbi42 commented May 10, 2024

casualwind commented May 13, 2024 • edited

casualwind commented May 13, 2024

cbi42 commented May 14, 2024

casualwind commented May 14, 2024

cbi42 May 15, 2024

Choose a reason for hiding this comment

casualwind May 15, 2024 • edited

Choose a reason for hiding this comment

cbi42 commented May 16, 2024

facebook-github-bot commented May 16, 2024

facebook-github-bot commented May 17, 2024

casualwind commented May 17, 2024

cbi42 commented May 20, 2024

cbi42 May 20, 2024

Choose a reason for hiding this comment

facebook-github-bot commented May 22, 2024

casualwind commented May 22, 2024

cbi42 May 22, 2024

Choose a reason for hiding this comment

cbi42 May 22, 2024

Choose a reason for hiding this comment

cbi42 May 22, 2024

Choose a reason for hiding this comment

cbi42 May 22, 2024

Choose a reason for hiding this comment

casualwind commented Apr 16, 2024 •

edited

casualwind commented May 13, 2024 •

edited

casualwind May 15, 2024 •

edited