Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss=0 for lora sft Baichuan2-13B-Chat with bf16 #3353

Closed
1 task done
conderls opened this issue Apr 19, 2024 · 0 comments
Closed
1 task done

loss=0 for lora sft Baichuan2-13B-Chat with bf16 #3353

conderls opened this issue Apr 19, 2024 · 0 comments
Labels
wontfix This will not be worked on

Comments

@conderls
Copy link

conderls commented Apr 19, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

image

using examples/lora_multi_gpu/single_node.sh and update some params as shown above

  • with -bf16 flag, the loss is 0 and grad_norm is nan
  • with -fp16 flag, the sft training process is success.

this behavior seems occurred after v0.6.0. I was using commit 2e592be, a quite early one, which works just fine.

Expected behavior

both bf16 and fp16 should work.

System Info

ubuntu 22.04 with H800 and torch 2.1.2, transformers 4.38.2

Others

similar issues: #3344 #3308

@hiyouga hiyouga added the pending This problem is yet to be addressed. label Apr 21, 2024
@hiyouga hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed. labels May 1, 2024
@hiyouga hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants