Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot use pure_bf16 with zero3 cpu offload #3476

Open
1 task done
mces89 opened this issue Apr 27, 2024 · 4 comments
Open
1 task done

cannot use pure_bf16 with zero3 cpu offload #3476

mces89 opened this issue Apr 27, 2024 · 4 comments
Labels
pending This problem is yet to be addressed.

Comments

@mces89
Copy link

mces89 commented Apr 27, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

I'm trying to do full sft for mixtral 8x22B, I used 2 8xa100(80g) instances. For the first try, i use pure_bf16 with zero3 but i get GPU OOM. Then i switched to zero3 with cpu offload, but i get:
Traceback (most recent call last):
File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 816, in
main()
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Expected behavior

No response

System Info

No response

Others

No response

@hiyouga
Copy link
Owner

hiyouga commented Apr 27, 2024

could you try using bf16 + pure_bf16 param?

@mces89
Copy link
Author

mces89 commented Apr 27, 2024

@hiyouga can you be more specific: --pure_bf16 can be used with --bf16 together? and w should i use cpu offload too?

@hiyouga
Copy link
Owner

hiyouga commented Apr 27, 2024

yep, use both of the params

@mces89
Copy link
Author

mces89 commented Apr 27, 2024

Thanks, I use --pure_bf16 and --bf16 together with the ds3_cpu_offload deepspeed config, but still get the same error. I'm using this command: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/full_multi_gpu/multi_node.sh
is it because i use torch.distributed.run? how can i convert it to deepspeed?

@hiyouga hiyouga added the pending This problem is yet to be addressed. label Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed.
Projects
None yet
Development

No branches or pull requests

2 participants