parallel operations for some binary caching providers #1392

Crzyrndm · 2024-04-25T01:17:12Z

This adds parallel operations using parallel_for_each / parallel_transform to ObjectStorageProvider and ObjectStoragePushProvider. There is currently no limitation placed on the parallelisation past that imposed by the threadpool.

This impacts the following caching providers

GcsStorageTool
AwsStorageTool
CosStorageTool

Questions

This adds parallelisation at the provider level. Is this the right place to add it?
It is likely that a parallelisation limit will be desired, probably with a default of 1 (no parallelisation) and a config option to increase that limit. I'm not familiar enough with this codebase to implement this cleanly without some direction
- I would normally use a semaphore for rate limiting. Is there any examples of similar functionality I can pull from
- How would this be added as a configuration? Environment variable? Examples to base on?
Testing?
- I couldn't find any tests for the affected caching providers and this isn't something I would normally test for anyway (implementation detail).

#38404

BillyONeal · 2024-05-14T21:29:51Z

In general, you're parallelizing the download here, but we might expect most of the time this to not be CPU limited and thus not improve as a result of parallelization. Do you have a real world use case where this did end up being meaningfully faster?

Crzyrndm · 2024-05-15T00:25:31Z

For clarity - my experience here is with an AWS S3 cache specifically
I fully agree that this is an IO limited task and therefore threading may not be the best model. This was more a quick demo to show there were significant potential improvements. If you have an example of a more appropriate set-up I can use I am open for ideas. I also haven't done anything to verify the safety/correctness of parallelisation of these operations so there may be issues there

Taking a project I have with 10 small dependencies (download size is minimal)
> $Env:VCPKG_BINARY_SOURCES="clear;x-aws,s3://<test bucket name>/,read"

read-only for consistency between tests

The lib uses vcpkg in manifest mode via CMake toolchain integration. Build machine is Windows 11 with i7-12700 (8P + 4E)

vcpkg/vcpkg --version
vcpkg package management program version 2024-03-14-7d353e869753e5609a1f1a057df3db8fd356e49d
# all measurements are from the cmake configure output
cmake --preset=<preset name>

AWS S3 cache miss

Using the default vcpkg release binary - querying for presence takes 9.5s

Using the custom vcpkg binary from this branch - querying for presence takes 1.7s

AWS S3 cache hit

Using the default release binary - query and download takes 23s

Using the custom vcpkg binary - query and download takes 4.3s

Summary

So approx a 5x improvement for both query and download and that should be fairly proportional to the total number of dependencies (if this was done correctly and all deps are relatively equivalent).
Writing back doesn't scale the same way because it's done after each package is built so please just ignore that

Note that typically I only use the cache for CI (GHA) and the time per package can be 1-3 seconds. If it was 5s to check the AWS cache I would probably consider using that locally as well, even better if microsoft/vcpkg#38684 was a thing

BillyONeal · 2024-05-15T20:18:41Z

I agree a real world 5x improvement is worth opening a can of worms for; OK

this is pushing to the different caches(?) in parallel, not by package like the others

Crzyrndm · 2024-05-15T21:19:16Z

Thoughts on where to go with this?
If you've got an example of parallelisation of shell commands that would be more appropriate than the quick and dirty parallel_* I used here I could take a whack at that. The data flow does lend itself to parallelisation well and the CLI tools for AWS/GCP are setup for multi-process operations to my knowledge so the correctness probably isn't as much of an issue as it could be (famous last words...)

Crzyrndm · 2024-05-27T22:59:58Z

cmd_execute_and_capture_output_parallel is doing pretty much the same thing I'm doing here. To use that directly would require a significant reshuffle of IObjectStorageTool API. The minimal form is to use the parallel functions to replace the loop as has been done.

Added a comment to the stat and download_file declarations noting the need for the impl to be thread safe
As far as I can tell, all operations within the parallel block either do not modify state or operate on data that is for the specific package (not shared). As long as the impls support parallelisation they should be correct

parallel operations for some binary caching providers

c3493d6

#38404

This comment was marked as off-topic.

Sign in to view

revert changes to upload path

6e386ed

this is pushing to the different caches(?) in parallel, not by package like the others

Crzyrndm added 2 commits May 28, 2024 10:51

remove the variable rename

e454927

add comment noting calls will occur in parallel

d94ca2c

Crzyrndm marked this pull request as ready for review May 27, 2024 22:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel operations for some binary caching providers #1392

parallel operations for some binary caching providers #1392

Crzyrndm commented Apr 25, 2024 •

edited

This comment was marked as off-topic.

BillyONeal commented May 14, 2024

Crzyrndm commented May 15, 2024 •

edited

BillyONeal commented May 15, 2024

Crzyrndm commented May 15, 2024

Crzyrndm commented May 27, 2024 •

edited

parallel operations for some binary caching providers #1392

Are you sure you want to change the base?

parallel operations for some binary caching providers #1392

Conversation

Crzyrndm commented Apr 25, 2024 • edited

This comment was marked as off-topic.

BillyONeal commented May 14, 2024

Crzyrndm commented May 15, 2024 • edited

AWS S3 cache miss

AWS S3 cache hit

Summary

BillyONeal commented May 15, 2024

Crzyrndm commented May 15, 2024

Crzyrndm commented May 27, 2024 • edited

Crzyrndm commented Apr 25, 2024 •

edited

Crzyrndm commented May 15, 2024 •

edited

Crzyrndm commented May 27, 2024 •

edited