You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dlrover version:v0.3.5
megatron version:main I encountered an error when using flash checkpoint in megatron:
Exception in thread checkpoint-saver:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 422, in _saver
saver: AsyncCheckpointSaver = class_def(**class_meta.kwargs)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 386, in init
self._event_queue = SharedQueue(name=qname, create=True)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 369, in init
super().init(name, create)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 188, in init
self._init_socket()
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 210, in _init_socket
self._server = _create_socket_server(self._socket_file)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 71, in _create_socket_server
server.bind(path)
OSError: [Errno 98] Address already in use
Exception ignored in: <function AsyncCheckpointSaver.del at 0x7efed6fbb490>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 402, in del
[2024-05-10 07:57:02,115] [INFO] [ckpt_saver.py:429:_factory] Start the checkpoint saver factory.
self.close()
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 494, in close
if not self._event_queue.empty():
AttributeError: 'MegatronCheckpointSaver' object has no attribute '_event_queue'
The text was updated successfully, but these errors were encountered:
dlrover version:v0.3.5
megatron version:main
I encountered an error when using flash checkpoint in megatron:
Exception in thread checkpoint-saver:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 422, in _saver
saver: AsyncCheckpointSaver = class_def(**class_meta.kwargs)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 386, in init
self._event_queue = SharedQueue(name=qname, create=True)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 369, in init
super().init(name, create)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 188, in init
self._init_socket()
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 210, in _init_socket
self._server = _create_socket_server(self._socket_file)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 71, in _create_socket_server
server.bind(path)
OSError: [Errno 98] Address already in use
Exception ignored in: <function AsyncCheckpointSaver.del at 0x7efed6fbb490>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 402, in del
[2024-05-10 07:57:02,115] [INFO] [ckpt_saver.py:429:_factory] Start the checkpoint saver factory.
self.close()
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 494, in close
if not self._event_queue.empty():
AttributeError: 'MegatronCheckpointSaver' object has no attribute '_event_queue'
The text was updated successfully, but these errors were encountered: