Dmoe integration #1210

DayOfThePenguin · 2024-05-06T03:56:55Z

Supersedes #1197

This PR adds dropless MoE support using the Grouped GEMM implementation in megablocks.

Features

Unlike the legacy DeepSpeed MoE implementation that uses the data parallel groups for expert parallelism, this implementation uses the model parallel group to parallelize the experts. This avoids the following problems:

Using data parallel groups to distribute the experts will incur inter-node communications to do a forward pass through a single layer
MoE + pipeline parallelism is very complicated to reason about when you have MoE weights distributed across data parallel groups & deepspeed doesn't natively support it.

Clarified arguments by separating MoE args into their own class.

Use sinkhorn routing by default, support k>=1. TopK routing is used for evaluation/inference.

Testing

Tested PP [3, 2, 1] and MP [1, 2, 4, 8] on Ampere GPUs.

Notes

Added megablocks and grouped_gemm to the dependencies. It might be desirable to pull some of the kernels in directly like in NVIDIA megatron-core.

…r 11.x

…oe_integration

…neox into dmoe_integration

…oe_integration

DayOfThePenguin added 3 commits May 5, 2024 23:07

feat: ensure only the right architectures get built vs all of them fo…

52a2001

…r 11.x

feat: clean up megablocks-based DMoE implementation

9c66895

feat: update args, configs, and requirements

3388c51

DayOfThePenguin requested a review from Quentin-Anthony as a code owner May 6, 2024 03:56

github-actions and others added 9 commits May 6, 2024 03:57

Update NeoXArgs docs automatically

e987126

Merge branch 'main' of https://github.com/EleutherAI/gpt-neox into dm…

7d6f265

…oe_integration

Merge branch 'dmoe_integration' of https://github.com/EleutherAI/gpt-…

4aebc2c

…neox into dmoe_integration

Update NeoXArgs docs automatically

d9f8d55

feat: Update readme and example config

f8c3776

Merge branch 'dmoe_integration' of https://github.com/EleutherAI/gpt-…

3ef5b66

…neox into dmoe_integration

Update NeoXArgs docs automatically

33e41d7

Merge branch 'main' of https://github.com/EleutherAI/gpt-neox into dm…

613aeb9

…oe_integration

Update NeoXArgs docs automatically

35c7225

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dmoe integration #1210

Dmoe integration #1210

DayOfThePenguin commented May 6, 2024

Dmoe integration #1210

Are you sure you want to change the base?

Dmoe integration #1210

Conversation

DayOfThePenguin commented May 6, 2024

Features

Testing

Notes