Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c10d: add Collectives abstraction #125978

Closed
wants to merge 1 commit into from
Closed

c10d: add Collectives abstraction #125978

wants to merge 1 commit into from

Conversation

d4l3k
Copy link
Collaborator

@d4l3k d4l3k commented May 10, 2024

This adds a new Collectives API for doing distributed collectives operations. This is intended to replace the current Elastic store abstraction with more performant and debugable primitives.

Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit

The standard implementation is using StoreCollectives but other more performant backends will be added in a follow up PR.

Test plan:

python test/distributed/test_collectives.py -v

This tests both functionality using multiple threads as well as timeout behavior.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

@d4l3k d4l3k requested review from zdevito, suo, wconstab and kurman May 10, 2024 23:12
Copy link

pytorch-bot bot commented May 10, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125978

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e14d5f7 with merge base 697ed6f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 10, 2024
@d4l3k d4l3k changed the title c10d: added Collectives abstraction c10d: add Collectives abstraction May 10, 2024
class TORCH_API Collectives : public torch::CustomClassHolder {
public:
virtual void barrier(
const std::string& prefix,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 'prefix' specific to store based implementation of the APIs? If so, I imagine the APIs could be implemented with non-store semantics

Copy link
Collaborator Author

@d4l3k d4l3k May 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It refers to the store based APIs but other implementations have a similar concept of an operation ID. I.e. the label field in https://fburl.com/code/dq8shbwi.

It may be best to change this name from prefix to "operation_uid"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if prefix better maps to e.g. sub_world_id in SC? Is the semantics that operations on each "prefix"/"subworld" must be sequential, and using different prefixes allows for a new namespace/fork to perform concurrent operations in e.g. separate threads?

But yeah, it does feel like prefix isn't a great name for the base class...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still a better fit for label -- all operations on store must be sequential regardless of prefix.

If you want a fully separate namespace you can use a different PrefixStore and create a new StoreCollectives instance -- similar to how you would use a different sub_world_id and create a new SCCollectives instance

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to generic key

@kwen2501
Copy link
Contributor

Does the term Collectives here refer to "collectives for control plane (torchrun/elastic)"?
If so, can file name and class name like Collectives.hpp be renamed to more clearly differentiate from the data plane collectives (i.e. c10d)? Thanks!

@kwen2501
Copy link
Contributor

In addition, it seems torch/csrc/distributed/c10d has many files now. Does it make sense for store to have its own directory? I am just not sure whether the new directory should under c10d or at the same level of c10d. If store is used by both c10d and torchrun/elastic, it seems the latter makes sense. However, a lot of store files also have c10d namespace, so that makes me uncertain.

Copy link
Collaborator

@Skylion007 Skylion007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance nits / improvements.

torch/csrc/distributed/c10d/StoreCollectives.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/c10d/StoreCollectives.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/c10d/StoreCollectives.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/c10d/StoreCollectives.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved
@d4l3k
Copy link
Collaborator Author

d4l3k commented May 13, 2024

updates:

  • moved to c10d/control_collectives dir
  • renamed Collectives to ControlCollectives
  • renamed prefix to key to be backend agnostic
  • applies fmt/vector reserve optimizations

Copy link
Contributor

@kurman kurman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice


namespace c10d {

class TORCH_API StoreCollectives : public ControlCollectives {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StoreControlCollectives?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that but figured Store would be clear enough -- it wouldn't make sense to use the Store API for dataplane operations

class TORCH_API ControlCollectives : public torch::CustomClassHolder {
public:
virtual void barrier(
const std::string& key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key is better, I wonder if it is actually a 'token'?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're equivalent -- key maps better to the underlying concept so will just leave it as is

}

void StoreCollectives::broadcast_send(
const std::string& key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these simple APIs such as broadcast, is 'key' really necessary for users? can we make it optional for users? e.g., they just need to provide the data to send and recv

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shuqiangzhang we need a unique identifier for every round due to some implementation details -- we could auto increment it internally but could cause problems. SC and store backends both require a unique key


def f(rank: int) -> None:
collectives = dist.StoreCollectives(store, rank, world_size)
collectives.barrier("foo", timedelta(seconds=10), True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mean this 'foo' does not really provide much value for users? could be optional or removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For store, this needs to be unique, if you call it multiple times with "foo" it won't actually do any barrier operation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I was thinking it would be much nicer and cleaner if our general API could remove the burden from users of maintaining an unique key for each collective due to implementation limitations. Can we maintain a seqID or something within the implementation? I think this is also similar to our data plane side of collective APIs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could potentially -- it gets a bit risky when you consider the case that not all workers may be joining at the same time. In elastic it's elastic so a simple seqID may not be sufficient and you need to key it off of the rendezvous round instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this makes sense in elastic, let's move forward

Copy link
Contributor

@shuqiangzhang shuqiangzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great start of the new control plane! Please do remember to unify the variable naming scheme

@d4l3k
Copy link
Collaborator Author

d4l3k commented May 15, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 15, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@d4l3k
Copy link
Collaborator Author

d4l3k commented May 17, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@d4l3k
Copy link
Collaborator Author

d4l3k commented May 17, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@d4l3k d4l3k deleted the d4l3k/collectives branch May 17, 2024 16:19
ZelboK pushed a commit to ZelboK/pytorch that referenced this pull request May 19, 2024
This adds a new `Collectives` API for doing distributed collectives operations. This is intended to replace the [current Elastic store abstraction](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/utils/store.py) with more performant and debugable primitives.

Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit

The standard implementation is using `StoreCollectives` but other more performant backends will be added in a follow up PR.

Test plan:

```
python test/distributed/test_collectives.py -v
```

This tests both functionality using multiple threads as well as timeout behavior.

Pull Request resolved: pytorch#125978
Approved by: https://github.com/shuqiangzhang
@DanilBaibak
Copy link
Contributor

@pytorchbot revert -m "Break internal build" -c ghfirst

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request May 20, 2024
This reverts commit 4b2ae2a.

Reverted #125978 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](#125978 (comment)))
@pytorchmergebot
Copy link
Collaborator

@d4l3k your PR has been successfully reverted.

@DanilBaibak
Copy link
Contributor

Hi @d4l3k! Sorry, I need to revert your PR because it breaks the internal build. Here you can find more information about the issue - D57518638.

d4l3k added a commit that referenced this pull request May 20, 2024
facebook-github-bot pushed a commit that referenced this pull request May 20, 2024
Summary:
This reverts commit d9c3485.

Reapplies #125978.


cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang


Reviewed By: c-p-i-o

Differential Revision: D57572818

Pulled By: d4l3k
facebook-github-bot pushed a commit that referenced this pull request May 20, 2024
Summary:
This reverts commit d9c3485.

Reapplies #125978.


cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang


Reviewed By: c-p-i-o

Differential Revision: D57572818

Pulled By: d4l3k
pytorchmergebot pushed a commit that referenced this pull request May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category Reverted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants