Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3 ncclInternalError: Internal check failed. #3405

Open
1 task done
lostsollar opened this issue Apr 24, 2024 · 1 comment
Labels
pending This problem is yet to be addressed.

Comments

@lostsollar
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

deepspeed
--include localhost:0
--master_port=9910 src/train_bash.py
--deepspeed ./ds_config.json
--stage sft
--model_name_or_path /disk1/models/Qwen/Qwen1.5-14B-Chat/
--do_train
--dataset dag_sample_shuxue_v1
--template qwen
--finetuning_type lora
--lora_rank 32
--lora_target q_proj,k_proj,v_proj,o_proj
--output_dir /output_dir
--adapter_name_or_path /checkpoint-800
--val_size 0.05
--per_device_train_batch_size 4
--per_device_eval_batch_size 2
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--optim paged_adamw_32bit
--lr_scheduler_type cosine
--logging_steps 5
--save_steps 100
--eval_steps 100
--warmup_steps 100
--learning_rate 2e-5
--max_steps 1500
--max_grad_norm 0.5
--num_train_epochs 2.0
--seed 7321
--overwrite_output_dir True
--quantization_bit 4
--evaluation_strategy steps
--plot_loss
--fp16

image

环境:
pytorch 2.2.0
pytorch-cuda 12.1
nccl 2.19.3

Expected behavior

No response

System Info

No response

Others

No response

@hiyouga hiyouga added the pending This problem is yet to be addressed. label Apr 24, 2024
@belle9217
Copy link

the same problem I encounter! how do you resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed.
Projects
None yet
Development

No branches or pull requests

3 participants