Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dmoe integration #1210

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Dmoe integration #1210

wants to merge 12 commits into from

Conversation

DayOfThePenguin
Copy link
Contributor

Supersedes #1197

This PR adds dropless MoE support using the Grouped GEMM implementation in megablocks.

Features

Unlike the legacy DeepSpeed MoE implementation that uses the data parallel groups for expert parallelism, this implementation uses the model parallel group to parallelize the experts. This avoids the following problems:

  • Using data parallel groups to distribute the experts will incur inter-node communications to do a forward pass through a single layer
  • MoE + pipeline parallelism is very complicated to reason about when you have MoE weights distributed across data parallel groups & deepspeed doesn't natively support it.

Clarified arguments by separating MoE args into their own class.

Use sinkhorn routing by default, support k>=1. TopK routing is used for evaluation/inference.

Testing

Tested PP [3, 2, 1] and MP [1, 2, 4, 8] on Ampere GPUs.

Notes

Added megablocks and grouped_gemm to the dependencies. It might be desirable to pull some of the kernels in directly like in NVIDIA megatron-core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant