We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using examples/lora_multi_gpu/single_node.sh and update some params as shown above
examples/lora_multi_gpu/single_node.sh
-bf16
grad_norm
-fp16
this behavior seems occurred after v0.6.0. I was using commit 2e592be, a quite early one, which works just fine.
both bf16 and fp16 should work.
ubuntu 22.04 with H800 and torch 2.1.2, transformers 4.38.2
similar issues: #3344 #3308
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Reminder
Reproduction
using
examples/lora_multi_gpu/single_node.sh
and update some params as shown above-bf16
flag, the loss is 0 andgrad_norm
is nan-fp16
flag, the sft training process is success.this behavior seems occurred after v0.6.0. I was using commit 2e592be, a quite early one, which works just fine.
Expected behavior
both bf16 and fp16 should work.
System Info
ubuntu 22.04 with H800 and torch 2.1.2, transformers 4.38.2
Others
similar issues: #3344 #3308
The text was updated successfully, but these errors were encountered: