Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues of LLaMA3 SFT on multi-nodes #3381

Open
1 task done
Liusifei opened this issue Apr 22, 2024 · 0 comments
Open
1 task done

Issues of LLaMA3 SFT on multi-nodes #3381

Liusifei opened this issue Apr 22, 2024 · 0 comments
Labels
pending This problem is yet to be addressed.

Comments

@Liusifei
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

When exec the following with Meta-Llama-3-8B, it seems that deepspeed cannot be imported correctly on the child nodes.

MASTER_PORT=25001
NPROC_PER_NODE=$1
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr

echo "Configuration for distributed training:"
echo "MASTER_ADDR: $MASTER_ADDR"
echo "MASTER_PORT: $MASTER_PORT"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"
echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
echo "Number of nodes: $SLURM_JOB_NUM_NODES"
echo "Node rank: $SLURM_PROCID"

python -m torch.distributed.run \
    --nproc_per_node $NPROC_PER_NODE \
    --nnodes $SLURM_JOB_NUM_NODES \
    --node_rank $SLURM_PROCID \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    src/train_bash.py \
    --deepspeed deepspeed3.json \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Meta-Llama-3-8B \
    --dataset Temp_ST1_sub \
    --template default \
    --streaming \
    --finetuning_type full \
    --output_dir saves/Temp111_ST1_lm38b_mn \
    --overwrite_cache \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 200 \
    --save_steps 400 \
    --learning_rate 2.0e-5 \
    --num_train_epochs 4.0 \
    --max_steps 60000 \
    --ddp_timeout 1800000 \
    --plot_loss \
    --bf16 \
    --dispatch_batches False \
    --ignore_data_skip

Error message snapshot:

line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 62, in <module>
    import deepspeed
  File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/__init__.py", line 25, in <module>
    from . import ops
  File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/__init__.py", line 15, in <module>
    from ..git_version_info import compatible_ops as __compatible_ops__
  File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/git_version_info.py", line 29, in <module>
    op_compatible = builder.is_compatible()
  File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/op_builder/fp_quantizer.py", line 29, in is_compatible
    sys_cuda_major, _ = installed_cuda_version()
  File "/root/miniconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 50, in installed_cuda_version
    raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
deepspeed.ops.op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)

Expected behavior

  1. This script works fine on single node, but yielding the above error when nnode>=2.
  2. This script works fine on other models like llama2 with nnode>=1.

System Info

  • transformers version: 4.40.0
  • Platform: Linux-5.15.0-1032-oracle-x86_64-with-glibc2.35
  • Python version: 3.10.14
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.3
  • Accelerate version: 0.29.3
  • PyTorch version (GPU?): 2.2.2+cu121 (True)

Name: deepspeed
Version: 0.14.1

Others

No response

@hiyouga hiyouga added the pending This problem is yet to be addressed. label Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed.
Projects
None yet
Development

No branches or pull requests

2 participants