You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 62, in <module>
import deepspeed
File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/__init__.py", line 25, in <module>
from . import ops
File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/__init__.py", line 15, in <module>
from ..git_version_info import compatible_ops as __compatible_ops__
File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/git_version_info.py", line 29, in <module>
op_compatible = builder.is_compatible()
File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/op_builder/fp_quantizer.py", line 29, in is_compatible
sys_cuda_major, _ = installed_cuda_version()
File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 50, in installed_cuda_version
raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
deepspeed.ops.op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
Expected behavior
This script works fine on single node, but yielding the above error when nnode>=2.
This script works fine on other models like llama2 with nnode>=1.
Reminder
Reproduction
When exec the following with
Meta-Llama-3-8B
, it seems that deepspeed cannot be imported correctly on the child nodes.Error message snapshot:
Expected behavior
System Info
transformers
version: 4.40.0Name: deepspeed
Version: 0.14.1
Others
No response
The text was updated successfully, but these errors were encountered: