cannot use pure_bf16 with zero3 cpu offload #3476

mces89 · 2024-04-27T00:17:14Z

Reminder

I have read the README and searched the existing issues.

Reproduction

I'm trying to do full sft for mixtral 8x22B, I used 2 8xa100(80g) instances. For the first try, i use pure_bf16 with zero3 but i get GPU OOM. Then i switched to zero3 with cpu offload, but i get:
Traceback (most recent call last):
File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 816, in
main()
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Expected behavior

No response

System Info

No response

Others

No response

hiyouga · 2024-04-27T01:17:01Z

could you try using bf16 + pure_bf16 param?

mces89 · 2024-04-27T01:20:57Z

@hiyouga can you be more specific: --pure_bf16 can be used with --bf16 together? and w should i use cpu offload too?

hiyouga · 2024-04-27T03:45:13Z

yep, use both of the params

mces89 · 2024-04-27T04:35:51Z

Thanks, I use --pure_bf16 and --bf16 together with the ds3_cpu_offload deepspeed config, but still get the same error. I'm using this command: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/full_multi_gpu/multi_node.sh
is it because i use torch.distributed.run? how can i convert it to deepspeed?

hiyouga added the pending This problem is yet to be addressed. label Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot use pure_bf16 with zero3 cpu offload #3476

cannot use pure_bf16 with zero3 cpu offload #3476

mces89 commented Apr 27, 2024

hiyouga commented Apr 27, 2024

mces89 commented Apr 27, 2024

hiyouga commented Apr 27, 2024

mces89 commented Apr 27, 2024

cannot use pure_bf16 with zero3 cpu offload #3476

cannot use pure_bf16 with zero3 cpu offload #3476

Comments

mces89 commented Apr 27, 2024

Reminder

Reproduction

Expected behavior

System Info

Others

hiyouga commented Apr 27, 2024

mces89 commented Apr 27, 2024

hiyouga commented Apr 27, 2024

mces89 commented Apr 27, 2024