Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bias weights are multi-added when using gpt_j_residual in model-parallel execution #962

Open
cbcase opened this issue May 31, 2023 · 1 comment
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@cbcase
Copy link

cbcase commented May 31, 2023

Describe the bug
In ParallelTransformerLayer.forward, when using the gpt_j_residual path, both the SelfAttention block and the MLP block produce parallel_output=True outputs and return activations and biases separately. The output biases of those blocks are replicated across model-parallel ranks (not divided), and each rank adds the bias before doing the model-parallel reduce. In pseudocode (and ignoring dropout), you have:

activations_parallel, bias_parallel = MLP(...)
activations_parallel += bias_parallel
output = activations_parallel.reduce()

As a result, if you run k-way model parallel, output contains k * bias_parallel rather than a single addition of the bias.

If you are training from scratch, it sort of is what it is (the model can learn to down-scale bias by k), but this breaks using an existing checkpoint for any model-parallel degree greater than 1. (And does so in a small way that is hard to perceive -- the accuracy penalty is not large, but it is measureable.)

Expected behavior
The number of times the bias is included should be irrespective of model parallel degree.

Proposed solution
This one is tricky. I have hacked around it in my own code for the dropout=0 case: wait to add the bias until after model-parallel-allreduce. But a general solution requires a more careful reorganization of the code.

Environment (please complete the following information):

  • GPUs: A100s
  • Configs: a 16B parameter model running model-parallel-size: 2
@cbcase cbcase added the bug Something isn't working label May 31, 2023
@StellaAthena
Copy link
Member

This is quite interesting, thanks for flagging it.

What if we just do activations_parallel += bias_parallel / tp_size? And I would guess that we could convert old checkpoints to work with this new code by multiplying the biases by tp_size?

@StellaAthena StellaAthena added the good first issue Good for newcomers label Jun 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants