Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRC64 perf improvements from Redis patches #350

Merged
merged 10 commits into from May 1, 2024

Conversation

josiahcarlson
Copy link
Contributor

@josiahcarlson josiahcarlson commented Apr 22, 2024

Improve the performance of crc64 for large batches by processing large number of bytes in parallel and combining the results.

Performance

  • 53-73% faster on Xeon 2670 v0 @ 2.6ghz
  • 2-2.5x faster on Core i3 8130U @ 2.2 ghz
  • 1.6-2.46 bytes/cycle on i3 8130U
  • likely >2x faster than crcspeed on newer CPUs with more resources than a 2012-era Xeon 2670
  • crc64 combine function runs in <50 nanoseconds typical with vector + cache optimizations (~8 microseconds without vector optimizations, ~80 *microseconds without cache, the combination is extra effective)
  • still single-threaded
  • valkey-server test crc64 --help (requires make distclean && make SERVER_TEST=yes)

* 53-73% faster on Xeon 2670 v0 @ 2.6ghz
* 2-2.5x faster on Core i3 8130U @ 2.2 ghz
* 1.6-2.46 bytes/cycle on i3 8130U
* likely >2x faster than crcspeed on newer CPUs with more resources than a 2012-era Xeon 2670
* crc64 combine function runs in <50 nanoseconds typical with vector + cache optimizations
  (~8 *microseconds* without vector optimizations, ~80 *microseconds without cache,
  the combination is extra effective)
* still single-threaded
* valkey-server test crc64 --help (requires `make distclean && make SERVER_TEST=yes`)

Signed-off-by: Josiah Carlson <josiah.carlson@gmail.com>
@zuiderkwast
Copy link
Contributor

What do we use crc64 for? RDB files?

I want to understand the practical benefit.

@madolson
Copy link
Member

What do we use crc64 for? RDB files?

RDB and slot migration. (Since the dump + restore includes the CRC checksumming). I mentioned before improving crc16 would be a bigger win for cluster mode.

@zuiderkwast
Copy link
Contributor

OK. It would be impossible to change to SHA1 or something else for RDB, DUMP/RESTORE, right? Because of backwards compatibility.

I think we should merge this then. It's more complex code, but as long as it's well tested and isolated in its own files, that's no problem to me.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superficial review. I didn't proof read the bit fiddling, but I'm not intending to.

When producing or loading a large RDB dump, how much of the CPU time is CRC64 computation and how much is other stuff?

If the CRC computation is haft of the CPU time, it's good, but if the CRC computation is only 1%, then this optimization is useless IMO.

src/crc64.c Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
src/Makefile Outdated Show resolved Hide resolved
src/crccombine.c Show resolved Hide resolved
@zuiderkwast
Copy link
Contributor

I found your blog post. 😃 https://www.dr-josiah.com/2023/04/crc64-in-redis-is-getting-even-faster.html

@josiahcarlson
Copy link
Contributor Author

When producing or loading a large RDB dump, how much of the CPU time is CRC64 computation and how much is other stuff?

Let me rephrase your question; why is doing CRC64 important for Redis?

Before loading a snapshot from disk, or received over the wire from a master server, the loading server will take a pass over the whole RDB file, and call out roughly once every 2 megabytes by default (see loading_process_events_interval_bytes). These changes save ~.5 seconds per gigabyte of snapshot to be loaded.

When creating RDB snapshots, any data segment written gets a call to CRC64 for up to 2 megabytes by default (also uses loading_process_events_interval_bytes), but typically closer to whatever the data storage size. So maybe tens of bytes at present.

That said, with additional small changes, you can update on write the same 2 meg segment all the time (instead of at maximum), and the effects of this (along with eliminating fork overhead) can be seen in 20 seconds of reduced startup / snapshot time for a 10 gig resident dataset from the last slide here: https://www.slideshare.net/secret/sLTnWqcPV7G0HK , of which ~8-10 seconds can be attributed to CRC64 performance improvements.

@zuiderkwast
Copy link
Contributor

Sounds good. I'm convinced. :)

FYI: You need Signed-off-by also for commits done using suggestions in the web gui. You can add that using git rebase --signoff HEAD~1 and then force-push. (Replace 1 with N to add signoff to the last N commits.)

Copy link

codecov bot commented Apr 24, 2024

Codecov Report

Attention: Patch coverage is 96.96970% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 68.48%. Comparing base (a989ee5) to head (f5c0509).
Report is 26 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #350      +/-   ##
============================================
+ Coverage     68.37%   68.48%   +0.10%     
============================================
  Files           108      109       +1     
  Lines         61555    61669     +114     
============================================
+ Hits          42091    42235     +144     
+ Misses        19464    19434      -30     
Files Coverage Δ
src/crc64.c 100.00% <100.00%> (ø)
src/crccombine.c 100.00% <100.00%> (ø)
src/crcspeed.c 37.97% <90.47%> (+9.18%) ⬆️

... and 17 files with indirect coverage changes

josiahcarlson and others added 2 commits April 24, 2024 15:42
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Josiah Carlson <josiah.carlson@gmail.com>
Signed-off-by: Josiah Carlson <josiah.carlson@gmail.com>
Signed-off-by: Josiah Carlson <josiah.carlson@gmail.com>
Signed-off-by: Josiah Carlson <josiah.carlson@gmail.com>
@madolson
Copy link
Member

Assuming tests pass, I'm happy with it.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Signed-off-by: Josiah Carlson <josiah.carlson@gmail.com>
Signed-off-by: Josiah Carlson <josiah.carlson@gmail.com>
src/crccombine.c Outdated Show resolved Hide resolved
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
src/crcspeed.c Outdated Show resolved Hide resolved
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
@madolson
Copy link
Member

Btw, I just want to test it on another arm machine. I'll do that tomorrow and merge it if the performance makes sense.

@zuiderkwast zuiderkwast added the to-be-merged Almost ready to merge label Apr 30, 2024
src/crccombine.c Outdated Show resolved Hide resolved
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, builds and some quick tests show it's faster than crcspeed.

@madolson madolson added release-notes This issue should get a line item in the release notes and removed to-be-merged Almost ready to merge labels May 1, 2024
@madolson madolson merged commit f4e10ee into valkey-io:unstable May 1, 2024
17 checks passed
PingXie pushed a commit to PingXie/valkey that referenced this pull request May 2, 2024
Improve the performance of crc64 for large batches by processing large
number of bytes in parallel and combining the results.

## Performance 
* 53-73% faster on Xeon 2670 v0 @ 2.6ghz
* 2-2.5x faster on Core i3 8130U @ 2.2 ghz
* 1.6-2.46 bytes/cycle on i3 8130U
* likely >2x faster than crcspeed on newer CPUs with more resources than
a 2012-era Xeon 2670
* crc64 combine function runs in <50 nanoseconds typical with vector +
cache optimizations (~8 *microseconds* without vector optimizations, ~80
*microseconds without cache, the combination is extra effective)
* still single-threaded
* valkey-server test crc64 --help (requires `make distclean && make
SERVER_TEST=yes`)

---------

Signed-off-by: Josiah Carlson <josiah.carlson@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants