Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel operations for some binary caching providers #1392

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Crzyrndm
Copy link

@Crzyrndm Crzyrndm commented Apr 25, 2024

microsoft/vcpkg#38404

This adds parallel operations using parallel_for_each / parallel_transform to ObjectStorageProvider and ObjectStoragePushProvider. There is currently no limitation placed on the parallelisation past that imposed by the threadpool.

This impacts the following caching providers

  • GcsStorageTool
  • AwsStorageTool
  • CosStorageTool

Questions

  • This adds parallelisation at the provider level. Is this the right place to add it?
  • It is likely that a parallelisation limit will be desired, probably with a default of 1 (no parallelisation) and a config option to increase that limit. I'm not familiar enough with this codebase to implement this cleanly without some direction
    • I would normally use a semaphore for rate limiting. Is there any examples of similar functionality I can pull from
    • How would this be added as a configuration? Environment variable? Examples to base on?
  • Testing?
    • I couldn't find any tests for the affected caching providers and this isn't something I would normally test for anyway (implementation detail).

@Crzyrndm

This comment was marked as off-topic.

@BillyONeal
Copy link
Member

In general, you're parallelizing the download here, but we might expect most of the time this to not be CPU limited and thus not improve as a result of parallelization. Do you have a real world use case where this did end up being meaningfully faster?

@Crzyrndm
Copy link
Author

Crzyrndm commented May 15, 2024

For clarity - my experience here is with an AWS S3 cache specifically
I fully agree that this is an IO limited task and therefore threading may not be the best model. This was more a quick demo to show there were significant potential improvements. If you have an example of a more appropriate set-up I can use I am open for ideas. I also haven't done anything to verify the safety/correctness of parallelisation of these operations so there may be issues there

Taking a project I have with 10 small dependencies (download size is minimal)
> $Env:VCPKG_BINARY_SOURCES="clear;x-aws,s3://<test bucket name>/,read"

  • read-only for consistency between tests

The lib uses vcpkg in manifest mode via CMake toolchain integration. Build machine is Windows 11 with i7-12700 (8P + 4E)

vcpkg/vcpkg --version
vcpkg package management program version 2024-03-14-7d353e869753e5609a1f1a057df3db8fd356e49d
# all measurements are from the cmake configure output
cmake --preset=<preset name>

AWS S3 cache miss

Using the default vcpkg release binary - querying for presence takes 9.5s

image

Using the custom vcpkg binary from this branch - querying for presence takes 1.7s

image

AWS S3 cache hit

Using the default release binary - query and download takes 23s

image

Using the custom vcpkg binary - query and download takes 4.3s

image

Summary

So approx a 5x improvement for both query and download and that should be fairly proportional to the total number of dependencies (if this was done correctly and all deps are relatively equivalent).
Writing back doesn't scale the same way because it's done after each package is built so please just ignore that

Note that typically I only use the cache for CI (GHA) and the time per package can be 1-3 seconds. If it was 5s to check the AWS cache I would probably consider using that locally as well, even better if microsoft/vcpkg#38684 was a thing

@BillyONeal
Copy link
Member

I agree a real world 5x improvement is worth opening a can of worms for; OK

this is pushing to the different caches(?) in parallel, not by package
like the others
@Crzyrndm
Copy link
Author

Thoughts on where to go with this?
If you've got an example of parallelisation of shell commands that would be more appropriate than the quick and dirty parallel_* I used here I could take a whack at that. The data flow does lend itself to parallelisation well and the CLI tools for AWS/GCP are setup for multi-process operations to my knowledge so the correctness probably isn't as much of an issue as it could be (famous last words...)

@Crzyrndm Crzyrndm marked this pull request as ready for review May 27, 2024 22:52
@Crzyrndm
Copy link
Author

Crzyrndm commented May 27, 2024

cmd_execute_and_capture_output_parallel is doing pretty much the same thing I'm doing here. To use that directly would require a significant reshuffle of IObjectStorageTool API. The minimal form is to use the parallel functions to replace the loop as has been done.

  • Added a comment to the stat and download_file declarations noting the need for the impl to be thread safe
  • As far as I can tell, all operations within the parallel block either do not modify state or operate on data that is for the specific package (not shared). As long as the impls support parallelisation they should be correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants