Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[might be bug?] Failed to connect to bootstrap peers when using docker image on truenas scale #511

Open
TomLBZ opened this issue Sep 16, 2023 · 4 comments

Comments

@TomLBZ
Copy link

TomLBZ commented Sep 16, 2023

2023-09-16 10:15:31.031074+00:00Sep 16 10:15:31.030 [INFO] Running Petals 2.2.0
2023-09-16 10:15:31.349212+00:00/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
2023-09-16 10:15:31.349257+00:00warnings.warn(
2023-09-16 10:15:33.018599+00:00Downloading (…)lve/main/config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|██████████| 609/609 [00:00<00:00, 4.34MB/s]
2023-09-16 10:15:33.021225+00:00Sep 16 10:15:33.021 [INFO] Make sure you follow the LLaMA's terms of use: https://bit.ly/llama2-license for LLaMA 2, https://bit.ly/llama-license for LLaMA 1
2023-09-16 10:15:33.021283+00:00Sep 16 10:15:33.021 [INFO] Using DHT prefix: Llama-2-70b-hf
2023-09-16 10:15:33.021870+00:00/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py:485: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
2023-09-16 10:15:33.021890+00:00warnings.warn(
2023-09-16 10:15:43.745860+00:00Traceback (most recent call last):
2023-09-16 10:15:43.745925+00:00File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2023-09-16 10:15:43.746035+00:00return _run_code(code, main_globals, None,
2023-09-16 10:15:43.746086+00:00File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
2023-09-16 10:15:43.746102+00:00exec(code, run_globals)
2023-09-16 10:15:43.746113+00:00File "/home/petals/src/petals/cli/run_server.py", line 235, in <module>
2023-09-16 10:15:43.746186+00:00main()
2023-09-16 10:15:43.746204+00:00File "/home/petals/src/petals/cli/run_server.py", line 219, in main
2023-09-16 10:15:43.746299+00:00server = Server(
2023-09-16 10:15:43.746313+00:00File "/home/petals/src/petals/server/server.py", line 138, in __init__
2023-09-16 10:15:43.746400+00:00is_reachable = check_direct_reachability(initial_peers=initial_peers, use_relay=False, **kwargs)
2023-09-16 10:15:43.746416+00:00File "/home/petals/src/petals/server/reachability.py", line 78, in check_direct_reachability
2023-09-16 10:15:43.746454+00:00return RemoteExpertWorker.run_coroutine(_check_direct_reachability())
2023-09-16 10:15:43.746473+00:00File "/opt/conda/lib/python3.10/site-packages/hivemind/moe/client/remote_expert_worker.py", line 36, in run_coroutine
2023-09-16 10:15:43.751352+00:00return future if return_future else future.result()
2023-09-16 10:15:43.751381+00:00File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 458, in result
2023-09-16 10:15:43.751782+00:00return self.__get_result()
2023-09-16 10:15:43.751811+00:00File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
2023-09-16 10:15:43.751840+00:00raise self._exception
2023-09-16 10:15:43.751873+00:00File "/home/petals/src/petals/server/reachability.py", line 59, in _check_direct_reachability
2023-09-16 10:15:43.751897+00:00target_dht = await DHTNode.create(client_mode=True, **kwargs)
2023-09-16 10:15:43.751907+00:00File "/opt/conda/lib/python3.10/site-packages/hivemind/dht/node.py", line 192, in create
2023-09-16 10:15:43.752325+00:00p2p = await P2P.create(**kwargs)
2023-09-16 10:15:43.752358+00:00File "/opt/conda/lib/python3.10/site-packages/hivemind/p2p/p2p_daemon.py", line 234, in create
2023-09-16 10:15:43.752725+00:00await asyncio.wait_for(ready, startup_timeout)
2023-09-16 10:15:43.752748+00:00File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
2023-09-16 10:15:43.753069+00:00return fut.result()
2023-09-16 10:15:43.753083+00:00hivemind.p2p.p2p_daemon_bindings.utils.P2PDaemonError: Daemon failed to start: 2023/09/16 10:15:43 failed to connect to bootstrap peers

I tried to host the docker container on truenas scale but failed with the error above. Might be a bug?

@redcap3000
Copy link

Having the same problem in linux attempting to connect to a private swarm.
File "/home/rcap3/anaconda3/lib/python3.11/site-packages/hivemind/dht/node.py", line 192, in create Sep 17 17:55:02 i7ubuntu python[322820]: p2p = await P2P.create(**kwargs) Sep 17 17:55:02 i7ubuntu python[322820]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ Sep 17 17:55:02 i7ubuntu python[322820]: File "/home/rcap3/anaconda3/lib/python3.11/site-packages/hivemind/p2p/p2p_daemon.py", line 234, in create Sep 17 17:55:02 i7ubuntu python[322820]: await asyncio.wait_for(ready, startup_timeout) Sep 17 17:55:02 i7ubuntu python[322820]: File "/home/rcap3/anaconda3/lib/python3.11/asyncio/tasks.py", line 479, in wait_for Sep 17 17:55:02 i7ubuntu python[322820]: return fut.result() Sep 17 17:55:02 i7ubuntu python[322820]: ^^^^^^^^^^^^ Sep 17 17:55:02 i7ubuntu python[322820]: hivemind.p2p.p2p_daemon_bindings.utils.P2PDaemonError: Daemon failed to start: 2023/09/17 17:55:02 failed to connect to bootstrap peers

@edugamerplay1228
Copy link

023-09-19 16:43:06.190015: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Sep 19 16:43:07.354 [INFO] Running Petals 2.2.0
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Downloading (…)lve/main/config.json: 100% 610/610 [00:00<00:00, 2.74MB/s]
Sep 19 16:43:07.883 [INFO] Make sure you follow the LLaMA's terms of use: https://bit.ly/llama2-license for LLaMA 2, https://bit.ly/llama-license for LLaMA 1
Sep 19 16:43:07.884 [INFO] Using DHT prefix: Llama-2-13b-hf
/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py:485: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/petals/cli/run_server.py", line 235, in
main()
File "/usr/local/lib/python3.10/dist-packages/petals/cli/run_server.py", line 219, in main
server = Server(
File "/usr/local/lib/python3.10/dist-packages/petals/server/server.py", line 138, in init
is_reachable = check_direct_reachability(initial_peers=initial_peers, use_relay=False, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/petals/server/reachability.py", line 78, in check_direct_reachability
return RemoteExpertWorker.run_coroutine(_check_direct_reachability())
File "/usr/local/lib/python3.10/dist-packages/hivemind/moe/client/remote_expert_worker.py", line 36, in run_coroutine
return future if return_future else future.result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/local/lib/python3.10/dist-packages/petals/server/reachability.py", line 59, in _check_direct_reachability
target_dht = await DHTNode.create(client_mode=True, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/hivemind/dht/node.py", line 192, in create
p2p = await P2P.create(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/hivemind/p2p/p2p_daemon.py", line 234, in create
await asyncio.wait_for(ready, startup_timeout)
File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
hivemind.p2p.p2p_daemon_bindings.utils.P2PDaemonError: Daemon failed to start: 2023/09/19 16:43:13 failed to connect to bootstrap peers

@borzunov
Copy link
Collaborator

borzunov commented Sep 19, 2023

Hi @TomLBZ @redcap3000 @edugamerplay1228,

This may be an issue with DNS/IPv6 addresses present among the default bootstrap peers. Can you please try again with this option (this uses IPv4 addresses only)?

--initial_peers /ip4/159.89.214.152/tcp/31337/p2p/QmedTaZXmULqwspJXz44SsPZyTNKxhnnFvYRajfH7MGhCY /ip4/159.203.156.48/tcp/31338/p2p/QmQGTqmM7NKjV6ggU1ZCap8zWiyKR89RViDXiqehSiCpY5

@hrQAQ
Copy link
Contributor

hrQAQ commented Nov 14, 2023

Hello @borzunov ,

I encountered a similar issue on Windows with WSL2 while attempting to connect to my own private swarm backbone. I have two hosts connected within the local area network and the error log is totally the same as this. Following your advice, I used this argument:

--initial_peers /ip4/159.89.214.152/tcp/31337/p2p/QmedTaZXmULqwspJXz44SsPZyTNKxhnnFvYRajfH7MGhCY /ip4/159.203.156.48/tcp/31338/p2p/QmQGTqmM7NKjV6ggU1ZCap8zWiyKR89RViDXiqehSiCpY5

This successfully connected the private swarm. However, I have encountered severe network throughput degradation with the private swarm backbone you provided. So I am curious about how to directly solve this problem instead of using public initial_peers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants