[Windows]: RuntimeError: Distributed package doesn't have NCCL built in #291

SkibaSAY · 2023-07-11T10:53:06Z

Hi, i try to run train.py in Windows. Help me please solve the problem.

System parameters

12th Gen Intel(R) Core(TM) i5-12600KF 3.70 GHz
32 GB
Cuda 11.8
Windows 11 Pro
Python 3.10.11

Command:

torchrun --nproc_per_node=1 train.py --model_name_or_path "D:\torrents\LLaMA\models\Alpaca_7B.bin" --data_path "D:\torrents\LLaMA\train_data\alpaca_protocol_train_data.json" --bf16 True --output_dir "D:\torrents\LLaMA\models\trained" --num_train_epochs 3 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 True

Error 1:

NOTE: Redirects are currently not supported in Windows or MacOs.
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-IECM8DM]:29500 (system error: 10049 - ...).

Traceback 1

Traceback (most recent call last):
File "D:\torrents\Stanford_Alpaca\stanford_alpaca\train.py", line 222, in
train()
File "D:\torrents\Stanford_Alpaca\stanford_alpaca\train.py", line 184, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\hf_argparser.py", line 346, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 113, in init
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\training_args.py", line 1340, in post_init
and (self.device.type != "cuda")
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\training_args.py", line 1764, in device
return self._setup_devices
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\utils\generic.py", line 54, in get
cached = self.fget(obj)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\training_args.py", line 1695, in _setup_devices
self.distributed_state = PartialState(backend=self.ddp_backend)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\state.py", line 191, in init
torch.distributed.init_process_group(backend=self.backend, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper(
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")

Error 2

RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 920468) of binary: C:\Users\User\AppData\Local\Programs\Python\Python310\python.exe

Traceback 2

Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\Scripts\torchrun.exe_main.py", line 7, in
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init.py", line 346, in wrapper
return f(*args, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\run.py", line 794, in main
run(args)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Similar error in another repository

I found a similar error in another repository: XavierXiao/Dreambooth-Stable-Diffusion#65
As far as I understand, this happens because NCCL does not work in Windows.

As a solution, they suggest setting environment variables:
PL_TORCH_DISTRIBUTED_BACKEND = gloo.

This solution did not work for me, but another solution was proposed. There are need to write in the code:
if sys.platform == "win32":
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Windows]: RuntimeError: Distributed package doesn't have NCCL built in #291

[Windows]: RuntimeError: Distributed package doesn't have NCCL built in #291

SkibaSAY commented Jul 11, 2023 •

edited

[Windows]: RuntimeError: Distributed package doesn't have NCCL built in #291

[Windows]: RuntimeError: Distributed package doesn't have NCCL built in #291

Comments

SkibaSAY commented Jul 11, 2023 • edited

System parameters

Command:

Error 1:

Traceback 1

Error 2

Traceback 2

Similar error in another repository

SkibaSAY commented Jul 11, 2023 •

edited