Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils: chunked_vector: optimize for trivially_copyable types #18609

Closed

Conversation

bhalevy
Copy link
Member

@bhalevy bhalevy commented May 10, 2024

When the chunked_vector holds trivially copyable type T it's more efficient to use std::memcpy to copy or move items around, then to call to copy or move constructors per item.

This change has no significant change on perf-simple-query:

base: median 95954.90 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42312 insns/op,        0 errors)
post: median 95543.43 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42343 insns/op,        0 errors)
  • ** Backport reason (please explain below if this patch should be backported or not) **
    optimization, no need to backport

@bhalevy bhalevy added the backport/none Backport is not required label May 10, 2024
@bhalevy bhalevy requested a review from nyh May 10, 2024 07:34
@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

  • Duration: 2 hr 49 min
  • Builder: spider7.cloudius-systems.com

@bhalevy
Copy link
Member Author

bhalevy commented May 11, 2024

@scylladb/scylla-maint please consider merging

@raphaelsc
Copy link
Member

When the chunked_vector holds trivially copyable type T it's more efficient to use std::memcpy to copy or move items around, then to call to copy or move constructors per item.

perf-simple-query results show slight improvement in instructions/ops:

base: median 97702.67 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42411 insns/op,        0 errors)
post: median 95048.61 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42291 insns/op,        0 errors)

why is tps lower?

@bhalevy
Copy link
Member Author

bhalevy commented May 12, 2024

When the chunked_vector holds trivially copyable type T it's more efficient to use std::memcpy to copy or move items around, then to call to copy or move constructors per item.
perf-simple-query results show slight improvement in instructions/ops:

base: median 97702.67 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42411 insns/op,        0 errors)
post: median 95048.61 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42291 insns/op,        0 errors)

why is tps lower?

I don't know. It's not very reliable since it's affected by real time, i.e. by the cpu clock and possibly by other processes, so it varies a lot in each run and between runs.
Instructions per op is much more reliable metric.

I submitted #18628 to change the default metric used for perf-simple-query to "instructions" per op, since practically, this is the one we're looking at to compare results on different test machines and across different runs and versions. The throughput (tps) result is simply not reliable enough due to its dependence on environmental factors (machine performance, possible temperature effects, other processes impact, and generally high jitter)

utils/chunked_vector.hh Outdated Show resolved Hide resolved
utils/chunked_vector.hh Outdated Show resolved Hide resolved
@bhalevy bhalevy force-pushed the chunked_vector-optimize-trivial-types branch from 5db23bc to f7a15be Compare May 12, 2024 15:18
@bhalevy
Copy link
Member Author

bhalevy commented May 12, 2024

v2 (f7a15be): use std::unitinialized_{copy,move} and std::destroy

utils/chunked_vector.hh Outdated Show resolved Hide resolved
utils/chunked_vector.hh Outdated Show resolved Hide resolved
}
if (auto remainder = size % max_chunk_capacity()) {
copy_chunk(i, remainder);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if() is likely unneeded. But maybe it is - can _chunks be empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, _chunks can be empty if the chunked_vector is empty.
Also _size might be aligned on max_chunk_capacity() and then there is no remainder.
We can turn this into an iteration on _chunks.size(), calculating size for chunk for each iteration.
It might be a bit more efficient for single chunk vectors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then copy_chunk(0) will be illegal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I had the first if (!size) in the beginning of the function.
In any case, godbolt .org indicates that a single loop will produce slightly more condensed machine code for x86_64, so I simplified the function in v3, by having a single loop and calculating the count per chunk in the loop body using std::min(size, max_chunk_size())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the new loop looks much better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But: it's not exception safe. You must update _size after every uninitialized_copy_n() succeeds, so the destructor can tear it down if the next one fails (it will destroy partial work that it does, not not work that previous iterations did).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you need a try/catch, since the destructor won't be called on a not-fully-constructed object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, will fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, @avikivity, the copy constructor calls the default constructor (which is implicitly noexcept).
So the destructor will called if an exception is thrown in the copy constructor function body, since it's already considered as fully constructed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the process, I've found out that the fill constructor isn't safe. Submitted #18636 (separately, to consider backport)

@bhalevy bhalevy force-pushed the chunked_vector-optimize-trivial-types branch from f7a15be to 31ff25a Compare May 12, 2024 19:33
@bhalevy
Copy link
Member Author

bhalevy commented May 12, 2024

In v3 (31ff25a):

  • removed extraneous #include
  • simplified copy logic.

utils/chunked_vector.hh Outdated Show resolved Hide resolved
@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 5 hr 59 min
  • Builder: i-0279a9a1ff7985ba8 (m5ad.8xlarge)

@bhalevy bhalevy force-pushed the chunked_vector-optimize-trivial-types branch from 31ff25a to 2dc42c6 Compare May 13, 2024 06:29
@bhalevy
Copy link
Member Author

bhalevy commented May 13, 2024

In v4 (2dc42c6):

  • rebased
  • incrementally update _size in copy ctor to facilitate destruction of partially copied object, on exception.
  • simplified destruction loop

@bhalevy bhalevy marked this pull request as draft May 13, 2024 09:57
@bhalevy
Copy link
Member Author

bhalevy commented May 13, 2024

Draft until #18636

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 5 hr 54 min
  • Builder: i-04e80069e8e21a6b5 (m5ad.8xlarge)

@bhalevy bhalevy force-pushed the chunked_vector-optimize-trivial-types branch from 2dc42c6 to 1c0ff86 Compare May 13, 2024 15:09
Use std::uninitialized_{copy,move} and std::destroy
that have optimizations for trivially copyable and
trivially moveable types.
In those cases, memory can be copied onto the uninitialized
memory, rather than invoking the respective copy/move constructors,
one item at a time.

perf-simple-query results:
```
base: median 95954.90 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42312 insns/op,        0 errors)
post: median 97530.65 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42331 insns/op,        0 errors)
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
@bhalevy bhalevy force-pushed the chunked_vector-optimize-trivial-types branch from 1c0ff86 to 744c794 Compare May 15, 2024 07:40
@bhalevy bhalevy marked this pull request as ready for review May 15, 2024 07:42
@bhalevy
Copy link
Member Author

bhalevy commented May 15, 2024

In v5 (744c794):

  • rebased
  • converted also fill ctor to work per-chunk and use std::uninitialized_fill_n

@nyh
Copy link
Contributor

nyh commented May 15, 2024

When the chunked_vector holds trivially copyable type T it's more efficient to use std::memcpy to copy or move items around, then to call to copy or move constructors per item.

This change has no significant change on perf-simple-query:

base: median 95954.90 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42312 insns/op,        0 errors)
post: median 95543.43 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42343 insns/op,        0 errors)
* [x]  ** Backport reason (please explain below if this patch should be backported or not) **
  optimization, no need to backport

If this "optimization" doesn't optimize that use case, I wonder what was your original motivation for doing it. Was it something that you thought of while reviewing the code? Do you have a more specific use case in mind which did get optimized by this change? Did you try any sort of unit-test micro-benchmark that got improved by your change?

Copy link
Contributor

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the algorithms carefully, and they look correct, and I'm beginning to understand the motivation. In Seastar's chunked_fifo.hh I also had a sophisticated destructor but then realized that it isn't common to destroy a full queue (one normally consumes the queue and not deletes it while it's very full) so I ifdefed out all the sophistication and left a simple loop implementation.
But I agree that chunked_vector isn't just about a queue - often it holds data that is read without popping - and then the entire vector is destroyed while still full. So this optimization is worthwhile.

@nyh
Copy link
Contributor

nyh commented May 15, 2024

@avikivity can you please leave your final stamp of approval? You had a lot of comments and requests previously.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 4 hr 41 min
  • Builder: i-085d1118792ec5a75 (m5ad.12xlarge)

@bhalevy
Copy link
Member Author

bhalevy commented May 15, 2024

When the chunked_vector holds trivially copyable type T it's more efficient to use std::memcpy to copy or move items around, then to call to copy or move constructors per item.
This change has no significant change on perf-simple-query:

base: median 95954.90 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42312 insns/op,        0 errors)
post: median 95543.43 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42343 insns/op,        0 errors)
* [x]  ** Backport reason (please explain below if this patch should be backported or not) **
  optimization, no need to backport

If this "optimization" doesn't optimize that use case, I wonder what was your original motivation for doing it. Was it something that you thought of while reviewing the code?

Yes, we should prefer bulk, contiguous operations that can work sequentially on each chunk, over doing per item operations that need to calculate the address for each item, plus they are optimized for trivial types, that we don't currently happen to use on the hot data path.

Do you have a more specific use case in mind which did get optimized by this change? Did you try any sort of unit-test micro-benchmark that got improved by your change?

#18545 will increase the usage of chunked_vector (with a non trivial type) on the hot path, so I expect to see somw benefits there (but not in perf simple query, since it probably won't generate long enough partition range vectors, maybe we need a separate benchmark for that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required promoted-to-master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants