Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:520 (errno: 13 - Permission denied). The server socket has failed to bind to ?UNKNOWN? (errno: 13 - Permission denied). #140

Open
ysz2000 opened this issue Apr 2, 2024 · 0 comments

Comments

@ysz2000
Copy link

ysz2000 commented Apr 2, 2024

Traceback (most recent call last):
File "/home/fangzhijun2/ChatGLM-Finetuning-master/train.py", line 234, in
main()
File "/home/fangzhijun2/ChatGLM-Finetuning-master/train.py", line 79, in main
deepspeed.init_distributed()
File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 121, in init
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 149, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 900, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 245, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/fangzhijun2/anaconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 176, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:520 (errno: 13 - Permission denied). The server socket has failed to bind to ?UNKNOWN? (errno: 13 - Permission denied).
[2024-04-02 16:47:05,134] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3061266
[2024-04-02 16:47:05,134] [ERROR] [launch.py:322:sigkill_handler] ['/home/fangzhijun2/anaconda3/envs/torch/bin/python', '-u', 'train.py', '--local_rank=0', '--train_path', 'data/spo_0.json', '--model_name_or_path', 'ChatGLM3-6B/', '--per_device_train_batch_size', '1', '--max_len', '1560', '--max_src_len', '1024', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '4', '--warmup_ratio', '0.1', '--mode', 'glm3', '--lora_dim', '16', '--lora_alpha', '64', '--lora_dropout', '0.1', '--lora_module_name', 'query_key_value,dense_h_to_4h,dense_4h_to_h,dense', '--seed', '1234', '--ds_file', 'ds_zero2_no_offload.json', '--gradient_checkpointing', '--show_loss_step', '10', '--output_dir', './output-glm3'] exits with return code = 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant