Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: [Errno 98] Address already in use #1113

Closed
chencjcj opened this issue May 10, 2024 · 3 comments
Closed

OSError: [Errno 98] Address already in use #1113

chencjcj opened this issue May 10, 2024 · 3 comments

Comments

@chencjcj
Copy link

chencjcj commented May 10, 2024

dlrover version:v0.3.5
megatron version:main

I encountered an error when using flash checkpoint in megatron

Exception in thread checkpoint-saver:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 422, in _saver
saver: AsyncCheckpointSaver = class_def(**class_meta.kwargs)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 386, in init
self._event_queue = SharedQueue(name=qname, create=True)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 369, in init
super().init(name, create)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 188, in init
self._init_socket()
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 210, in _init_socket
self._server = _create_socket_server(self._socket_file)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 71, in _create_socket_server
server.bind(path)
OSError: [Errno 98] Address already in use
Exception ignored in: <function AsyncCheckpointSaver.del at 0x7efed6fbb490>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 402, in del
[2024-05-10 07:57:02,115] [INFO] [ckpt_saver.py:429:_factory] Start the checkpoint saver factory.
self.close()
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 494, in close
if not self._event_queue.empty():
AttributeError: 'MegatronCheckpointSaver' object has no attribute '_event_queue'

@chencjcj chencjcj reopened this May 10, 2024
@workingloong
Copy link
Collaborator

Can you retry it with dlrover[torch]==0.3.7. We have fixed some bugs for Megatron-LM after 0.3.5 and the bug may have been fixed.

@chencjcj
Copy link
Author

Can you give me a tested version of megatron-lm? When I use megatron-lm-main, I get some errors.

@workingloong
Copy link
Collaborator

You can test it with the repo https://github.com/workingloong/Megatron-LM-CKPT forked from Megatron-LM in 2024.02.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants