utils: chunked_vector: optimize for trivially_copyable types #18609

bhalevy · 2024-05-10T07:33:56Z

When the chunked_vector holds trivially copyable type T it's more efficient to use std::memcpy to copy or move items around, then to call to copy or move constructors per item.

This change has no significant change on perf-simple-query:

base: median 95954.90 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42312 insns/op,        0 errors)
post: median 95543.43 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42343 insns/op,        0 errors)

** Backport reason (please explain below if this patch should be backported or not) **
optimization, no need to backport

scylladb-promoter · 2024-05-10T10:24:21Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

Duration: 2 hr 49 min
Builder: spider7.cloudius-systems.com

bhalevy · 2024-05-11T17:01:19Z

@scylladb/scylla-maint please consider merging

raphaelsc · 2024-05-11T23:26:09Z

When the chunked_vector holds trivially copyable type T it's more efficient to use std::memcpy to copy or move items around, then to call to copy or move constructors per item.

perf-simple-query results show slight improvement in instructions/ops:
base: median 97702.67 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42411 insns/op,        0 errors)
post: median 95048.61 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42291 insns/op,        0 errors)

why is tps lower?

bhalevy · 2024-05-12T04:49:05Z

When the chunked_vector holds trivially copyable type T it's more efficient to use std::memcpy to copy or move items around, then to call to copy or move constructors per item.
perf-simple-query results show slight improvement in instructions/ops:
base: median 97702.67 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42411 insns/op,        0 errors)
post: median 95048.61 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42291 insns/op,        0 errors)
why is tps lower?

I don't know. It's not very reliable since it's affected by real time, i.e. by the cpu clock and possibly by other processes, so it varies a lot in each run and between runs.
Instructions per op is much more reliable metric.

I submitted #18628 to change the default metric used for perf-simple-query to "instructions" per op, since practically, this is the one we're looking at to compare results on different test machines and across different runs and versions. The throughput (tps) result is simply not reliable enough due to its dependence on environmental factors (machine performance, possible temperature effects, other processes impact, and generally high jitter)

utils/chunked_vector.hh

bhalevy · 2024-05-12T15:20:04Z

v2 (f7a15be): use std::unitinialized_{copy,move} and std::destroy

utils/chunked_vector.hh

avikivity · 2024-05-12T17:59:28Z

utils/chunked_vector.hh

+    }
+    if (auto remainder = size % max_chunk_capacity()) {
+        copy_chunk(i, remainder);
+    }


The if() is likely unneeded. But maybe it is - can _chunks be empty?

yes, _chunks can be empty if the chunked_vector is empty.
Also _size might be aligned on max_chunk_capacity() and then there is no remainder.
We can turn this into an iteration on _chunks.size(), calculating size for chunk for each iteration.
It might be a bit more efficient for single chunk vectors

Then copy_chunk(0) will be illegal.

That's why I had the first if (!size) in the beginning of the function.
In any case, godbolt .org indicates that a single loop will produce slightly more condensed machine code for x86_64, so I simplified the function in v3, by having a single loop and calculating the count per chunk in the loop body using std::min(size, max_chunk_size())

Yes the new loop looks much better.

But: it's not exception safe. You must update _size after every uninitialized_copy_n() succeeds, so the destructor can tear it down if the next one fails (it will destroy partial work that it does, not not work that previous iterations did).

No, you need a try/catch, since the destructor won't be called on a not-fully-constructed object.

good point, will fix.

Actually, @avikivity, the copy constructor calls the default constructor (which is implicitly noexcept).
So the destructor will called if an exception is thrown in the copy constructor function body, since it's already considered as fully constructed.

In the process, I've found out that the fill constructor isn't safe. Submitted #18636 (separately, to consider backport)

bhalevy · 2024-05-12T19:37:00Z

In v3 (31ff25a):

removed extraneous #include
simplified copy logic.

utils/chunked_vector.hh

scylladb-promoter · 2024-05-13T01:33:34Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 5 hr 59 min
Builder: i-0279a9a1ff7985ba8 (m5ad.8xlarge)

bhalevy · 2024-05-13T06:31:19Z

In v4 (2dc42c6):

rebased
incrementally update _size in copy ctor to facilitate destruction of partially copied object, on exception.
simplified destruction loop

bhalevy · 2024-05-13T09:58:11Z

Draft until #18636

scylladb-promoter · 2024-05-13T12:24:01Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 5 hr 54 min
Builder: i-04e80069e8e21a6b5 (m5ad.8xlarge)

Use std::uninitialized_{copy,move} and std::destroy that have optimizations for trivially copyable and trivially moveable types. In those cases, memory can be copied onto the uninitialized memory, rather than invoking the respective copy/move constructors, one item at a time. perf-simple-query results: ``` base: median 95954.90 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 42312 insns/op, 0 errors) post: median 97530.65 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 42331 insns/op, 0 errors) ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy · 2024-05-15T07:43:20Z

In v5 (744c794):

rebased
converted also fill ctor to work per-chunk and use std::uninitialized_fill_n

nyh · 2024-05-15T07:46:33Z

When the chunked_vector holds trivially copyable type T it's more efficient to use std::memcpy to copy or move items around, then to call to copy or move constructors per item.

This change has no significant change on perf-simple-query:
base: median 95954.90 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42312 insns/op,        0 errors)
post: median 95543.43 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42343 insns/op,        0 errors)
* [x]  ** Backport reason (please explain below if this patch should be backported or not) **
  optimization, no need to backport

If this "optimization" doesn't optimize that use case, I wonder what was your original motivation for doing it. Was it something that you thought of while reviewing the code? Do you have a more specific use case in mind which did get optimized by this change? Did you try any sort of unit-test micro-benchmark that got improved by your change?

nyh

I looked at the algorithms carefully, and they look correct, and I'm beginning to understand the motivation. In Seastar's chunked_fifo.hh I also had a sophisticated destructor but then realized that it isn't common to destroy a full queue (one normally consumes the queue and not deletes it while it's very full) so I ifdefed out all the sophistication and left a simple loop implementation.
But I agree that chunked_vector isn't just about a queue - often it holds data that is read without popping - and then the entire vector is destroyed while still full. So this optimization is worthwhile.

nyh · 2024-05-15T08:07:37Z

@avikivity can you please leave your final stamp of approval? You had a lot of comments and requests previously.

scylladb-promoter · 2024-05-15T17:13:24Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 4 hr 41 min
Builder: i-085d1118792ec5a75 (m5ad.12xlarge)

bhalevy · 2024-05-15T18:11:16Z

When the chunked_vector holds trivially copyable type T it's more efficient to use std::memcpy to copy or move items around, then to call to copy or move constructors per item.
This change has no significant change on perf-simple-query:
base: median 95954.90 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42312 insns/op,        0 errors)
post: median 95543.43 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42343 insns/op,        0 errors)
* [x]  ** Backport reason (please explain below if this patch should be backported or not) **
  optimization, no need to backport
If this "optimization" doesn't optimize that use case, I wonder what was your original motivation for doing it. Was it something that you thought of while reviewing the code?

Yes, we should prefer bulk, contiguous operations that can work sequentially on each chunk, over doing per item operations that need to calculate the address for each item, plus they are optimized for trivial types, that we don't currently happen to use on the hot data path.

Do you have a more specific use case in mind which did get optimized by this change? Did you try any sort of unit-test micro-benchmark that got improved by your change?

#18545 will increase the usage of chunked_vector (with a non trivial type) on the hot path, so I expect to see somw benefits there (but not in perf simple query, since it probably won't generate long enough partition range vectors, maybe we need a separate benchmark for that).

bhalevy added the backport/none Backport is not required label May 10, 2024

bhalevy requested a review from nyh May 10, 2024 07:34

denesb approved these changes May 10, 2024

View reviewed changes