-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utils: chunked_vector: optimize for trivially_copyable types #18609
utils: chunked_vector: optimize for trivially_copyable types #18609
Conversation
🟢 CI State: SUCCESS✅ - Build Build Details:
|
@scylladb/scylla-maint please consider merging |
why is tps lower? |
I don't know. It's not very reliable since it's affected by real time, i.e. by the cpu clock and possibly by other processes, so it varies a lot in each run and between runs. I submitted #18628 to change the default metric used for perf-simple-query to "instructions" per op, since practically, this is the one we're looking at to compare results on different test machines and across different runs and versions. The throughput (tps) result is simply not reliable enough due to its dependence on environmental factors (machine performance, possible temperature effects, other processes impact, and generally high jitter) |
5db23bc
to
f7a15be
Compare
v2 (f7a15be): use |
} | ||
if (auto remainder = size % max_chunk_capacity()) { | ||
copy_chunk(i, remainder); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if() is likely unneeded. But maybe it is - can _chunks be empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, _chunks can be empty if the chunked_vector is empty.
Also _size might be aligned on max_chunk_capacity()
and then there is no remainder.
We can turn this into an iteration on _chunks.size(), calculating size for chunk for each iteration.
It might be a bit more efficient for single chunk vectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then copy_chunk(0) will be illegal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I had the first if (!size)
in the beginning of the function.
In any case, godbolt .org indicates that a single loop will produce slightly more condensed machine code for x86_64, so I simplified the function in v3, by having a single loop and calculating the count per chunk in the loop body using std::min(size, max_chunk_size())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the new loop looks much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But: it's not exception safe. You must update _size after every uninitialized_copy_n() succeeds, so the destructor can tear it down if the next one fails (it will destroy partial work that it does, not not work that previous iterations did).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you need a try/catch, since the destructor won't be called on a not-fully-constructed object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, @avikivity, the copy constructor calls the default constructor (which is implicitly noexcept).
So the destructor will called if an exception is thrown in the copy constructor function body, since it's already considered as fully constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the process, I've found out that the fill constructor isn't safe. Submitted #18636 (separately, to consider backport)
f7a15be
to
31ff25a
Compare
In v3 (31ff25a):
|
🟢 CI State: SUCCESS✅ - Build Build Details:
|
31ff25a
to
2dc42c6
Compare
In v4 (2dc42c6):
|
Draft until #18636 |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
2dc42c6
to
1c0ff86
Compare
Use std::uninitialized_{copy,move} and std::destroy that have optimizations for trivially copyable and trivially moveable types. In those cases, memory can be copied onto the uninitialized memory, rather than invoking the respective copy/move constructors, one item at a time. perf-simple-query results: ``` base: median 95954.90 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 42312 insns/op, 0 errors) post: median 97530.65 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 42331 insns/op, 0 errors) ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
1c0ff86
to
744c794
Compare
In v5 (744c794):
|
If this "optimization" doesn't optimize that use case, I wonder what was your original motivation for doing it. Was it something that you thought of while reviewing the code? Do you have a more specific use case in mind which did get optimized by this change? Did you try any sort of unit-test micro-benchmark that got improved by your change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the algorithms carefully, and they look correct, and I'm beginning to understand the motivation. In Seastar's chunked_fifo.hh I also had a sophisticated destructor but then realized that it isn't common to destroy a full queue (one normally consumes the queue and not deletes it while it's very full) so I ifdefed out all the sophistication and left a simple loop implementation.
But I agree that chunked_vector isn't just about a queue - often it holds data that is read without popping - and then the entire vector is destroyed while still full. So this optimization is worthwhile.
@avikivity can you please leave your final stamp of approval? You had a lot of comments and requests previously. |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
Yes, we should prefer bulk, contiguous operations that can work sequentially on each chunk, over doing per item operations that need to calculate the address for each item, plus they are optimized for trivial types, that we don't currently happen to use on the hot data path.
#18545 will increase the usage of chunked_vector (with a non trivial type) on the hot path, so I expect to see somw benefits there (but not in perf simple query, since it probably won't generate long enough partition range vectors, maybe we need a separate benchmark for that). |
When the chunked_vector holds trivially copyable type T it's more efficient to use std::memcpy to copy or move items around, then to call to copy or move constructors per item.
This change has no significant change on perf-simple-query:
optimization, no need to backport