You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the README and searched the existing issues.
Reproduction
I'm trying to do full sft for mixtral 8x22B, I used 2 8xa100(80g) instances. For the first try, i use pure_bf16 with zero3 but i get GPU OOM. Then i switched to zero3 with cpu offload, but i get:
Traceback (most recent call last):
File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 816, in
main()
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Expected behavior
No response
System Info
No response
Others
No response
The text was updated successfully, but these errors were encountered:
Reminder
Reproduction
I'm trying to do full sft for mixtral 8x22B, I used 2 8xa100(80g) instances. For the first try, i use pure_bf16 with zero3 but i get GPU OOM. Then i switched to zero3 with cpu offload, but i get:
Traceback (most recent call last):
File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/ptca/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 816, in
main()
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Expected behavior
No response
System Info
No response
Others
No response
The text was updated successfully, but these errors were encountered: