New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c10d: add Collectives abstraction #125978
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125978
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e14d5f7 with merge base 697ed6f (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
class TORCH_API Collectives : public torch::CustomClassHolder { | ||
public: | ||
virtual void barrier( | ||
const std::string& prefix, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 'prefix' specific to store based implementation of the APIs? If so, I imagine the APIs could be implemented with non-store semantics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It refers to the store based APIs but other implementations have a similar concept of an operation ID. I.e. the label field in https://fburl.com/code/dq8shbwi.
It may be best to change this name from prefix to "operation_uid"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if prefix
better maps to e.g. sub_world_id
in SC? Is the semantics that operations on each "prefix"/"subworld" must be sequential, and using different prefixes allows for a new namespace/fork to perform concurrent operations in e.g. separate threads?
But yeah, it does feel like prefix isn't a great name for the base class...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still a better fit for label -- all operations on store must be sequential regardless of prefix.
If you want a fully separate namespace you can use a different PrefixStore and create a new StoreCollectives instance -- similar to how you would use a different sub_world_id and create a new SCCollectives instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to generic key
Does the term |
In addition, it seems torch/csrc/distributed/c10d has many files now. Does it make sense for store to have its own directory? I am just not sure whether the new directory should under c10d or at the same level of c10d. If store is used by both c10d and torchrun/elastic, it seems the latter makes sense. However, a lot of store files also have c10d namespace, so that makes me uncertain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance nits / improvements.
updates:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
|
||
namespace c10d { | ||
|
||
class TORCH_API StoreCollectives : public ControlCollectives { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StoreControlCollectives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that but figured Store would be clear enough -- it wouldn't make sense to use the Store API for dataplane operations
class TORCH_API ControlCollectives : public torch::CustomClassHolder { | ||
public: | ||
virtual void barrier( | ||
const std::string& key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key
is better, I wonder if it is actually a 'token'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they're equivalent -- key maps better to the underlying concept so will just leave it as is
} | ||
|
||
void StoreCollectives::broadcast_send( | ||
const std::string& key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these simple APIs such as broadcast, is 'key' really necessary for users? can we make it optional for users? e.g., they just need to provide the data to send and recv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shuqiangzhang we need a unique identifier for every round due to some implementation details -- we could auto increment it internally but could cause problems. SC and store backends both require a unique key
|
||
def f(rank: int) -> None: | ||
collectives = dist.StoreCollectives(store, rank, world_size) | ||
collectives.barrier("foo", timedelta(seconds=10), True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I mean this 'foo' does not really provide much value for users? could be optional or removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For store, this needs to be unique, if you call it multiple times with "foo" it won't actually do any barrier operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I was thinking it would be much nicer and cleaner if our general API could remove the burden from users of maintaining an unique key for each collective due to implementation limitations. Can we maintain a seqID or something within the implementation? I think this is also similar to our data plane side of collective APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could potentially -- it gets a bit risky when you consider the case that not all workers may be joining at the same time. In elastic it's elastic so a simple seqID may not be sufficient and you need to key it off of the rendezvous round instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this makes sense in elastic, let's move forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great start of the new control plane! Please do remember to unify the variable naming scheme
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This adds a new `Collectives` API for doing distributed collectives operations. This is intended to replace the [current Elastic store abstraction](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/utils/store.py) with more performant and debugable primitives. Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit The standard implementation is using `StoreCollectives` but other more performant backends will be added in a follow up PR. Test plan: ``` python test/distributed/test_collectives.py -v ``` This tests both functionality using multiple threads as well as timeout behavior. Pull Request resolved: pytorch#125978 Approved by: https://github.com/shuqiangzhang
@pytorchbot revert -m "Break internal build" -c ghfirst |
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 4b2ae2a. Reverted #125978 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](#125978 (comment)))
@d4l3k your PR has been successfully reverted. |
Summary: This reverts commit d9c3485. Reapplies #125978. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang Reviewed By: c-p-i-o Differential Revision: D57572818 Pulled By: d4l3k
Summary: This reverts commit d9c3485. Reapplies #125978. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang Reviewed By: c-p-i-o Differential Revision: D57572818 Pulled By: d4l3k
This reverts commit d9c3485. Reapplies #125978. Pull Request resolved: #126695 Approved by: https://github.com/c-p-i-o
This adds a new
Collectives
API for doing distributed collectives operations. This is intended to replace the current Elastic store abstraction with more performant and debugable primitives.Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit
The standard implementation is using
StoreCollectives
but other more performant backends will be added in a follow up PR.Test plan:
This tests both functionality using multiple threads as well as timeout behavior.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang