Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] bf16 Parameters and fp32 Gradients #800

Open
pluiez opened this issue Apr 30, 2024 · 0 comments
Open

[QUESTION] bf16 Parameters and fp32 Gradients #800

pluiez opened this issue Apr 30, 2024 · 0 comments

Comments

@pluiez
Copy link

pluiez commented Apr 30, 2024

In the README for the distributed optimizer, it is mentioned that when using bf16 training, a combination of bf16 model parameters and fp32 model grads is employed, and the distributed optimizer's fp32 main gradients are the same as the model's fp32 gradients. However, I am aware that in PyTorch, after the forward and backward passes, the gradients after forward+backward typically match the data type of the parameters. So, there should be always bf16 model grads given bf16 mdoel params, and this is apparently true in the case of fp16 training where an extra copy of fp32 main grads in the optimizer is necessary.

Could you please explain how it is possible to have bf16 parameters with fp32 gradients in the context of bf16 training? I am wondering why is there a difference between fp16 and bf16 training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant