Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFT zero2 zero3下loss不一致 #3442

Open
1 task done
wsdmanonymous opened this issue Apr 25, 2024 · 4 comments
Open
1 task done

SFT zero2 zero3下loss不一致 #3442

wsdmanonymous opened this issue Apr 25, 2024 · 4 comments
Labels
pending This problem is yet to be addressed.

Comments

@wsdmanonymous
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

训练qwen1时保持除deepspeed zero2/zero3外其他超参配置不变的情况下,loss差异特别大。请教下之前有做过此类的实验么,这样是否是符合预期的?
image

Expected behavior

No response

System Info

No response

Others

No response

@hiyouga hiyouga added the pending This problem is yet to be addressed. label Apr 25, 2024
@Egber1t
Copy link

Egber1t commented Apr 28, 2024

佬,请问这个图是llama factory自带框架出来的图吗?

@xielinzhen
Copy link

您好!请问zero3通信成本高吗,我sft llama3-8B 20个steps zero2只要17, zero3要20分钟,是不是有点不正常

@wsdmanonymous
Copy link
Author

佬,请问这个图是llama factory自带框架出来的图吗?

佬,不是哈

@wsdmanonymous
Copy link
Author

您好!请问zero3通信成本高吗,我sft llama3-8B 20个steps zero2只要17, zero3要20分钟,是不是有点不正常

是高不少,特别是多机器时候,性能瓶颈基本就在通信上。但是具体差异多少应该跟机型关系挺大的,我在2块A800上开IB网络时没有差异那么大,大概4-5倍差异

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed.
Projects
None yet
Development

No branches or pull requests

4 participants