Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

freezing layers have differenct behaves for different models #438

Open
2 tasks
hjc3613 opened this issue Apr 18, 2024 · 2 comments
Open
2 tasks

freezing layers have differenct behaves for different models #438

hjc3613 opened this issue Apr 18, 2024 · 2 comments
Assignees
Labels

Comments

@hjc3613
Copy link

hjc3613 commented Apr 18, 2024

System Info

Dockerfile:
image

Information

  • The official example scripts
  • My own modified scripts

馃悰 Describe the bug

when freezing the top n layers, it has different effect for different models:
for qwen, the freezed layers has no gradients, so the gpu ram becomes lower contrast to none freezing.
but for qwen1.5 and wizard-8x22B, the freezed layers seemes to have gradients, so the gpu ram still high, even when I freeze top n-1 layers and fintune the last one layer, it gives me OOM when trained on 3 nodes(total 24* A800).

Error logs

cuda out of memoy

Expected behavior

can freeze top layers for any models

@HamidShojanazeri
Copy link
Contributor

@hjc3613 sorry for the inconvenience, this feature is not well tested thats why we didn't mention it much. If you are interested, would love to work with you and make a PR to fix the issues.

@hjc3613
Copy link
Author

hjc3613 commented Apr 19, 2024

@hjc3613 sorry for the inconvenience, this feature is not well tested thats why we didn't mention it much. If you are interested, would love to work with you and make a PR to fix the issues.

I think this feature is necessary, it can save much gpu memory and get the same results as full parameter training in some practical. I have test in my task that when freeze top 40 layers(total 80 layers), the test results between none freeze are same锛侊紒but only use 1 nodes(8*80G)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants