[BUG] Failed to load_state_from_peers at the first time because of "list index out of range" error #504

alex-snd · 2022-09-05T16:50:28Z

Describe the bug

A new peer cannot synchronize its state at the first time with other peers because of list index out of range error. And at best, new peer succeeds only on the second attempt, at worst, it cannot synchronize its state at all.

Failed to load state from peers: list index out of range, retrying ...
Traceback (most recent call last):
  File "/home/TRecover/venv/lib/python3.8/site-packages/hivemind/optim/optimizer.py", line 694, in load_state_from_peers
    self.state_averager.load_state_from_peers(timeout=self.load_state_timeout, **kwargs)
  File "/home/TRecover/venv/lib/python3.8/site-packages/hivemind/optim/state_averager.py", line 667, in load_state_from_peers
    load_optimizer_state(self.optimizer, metadata["optimizer_metadata"], loaded_opt_tensors)
  File "/home/TRecover/venv/lib/python3.8/site-packages/hivemind/optim/state_averager.py", line 720, in load_optimizer_state
    flat_optimizer_state.append(flat_tensors])
IndexError: list index out of range

I conducted an experiment to see how the new peer synchronizes its state with another peer (next, we will call it the first).
Important clarification: the first peer and the new one have the same structure - same number of tensors on metadata (which contains all non-tensors).

def structure_shape(self) -> Tuple[int, int]:
    metadata, all_tensors, _= self.hivemind_optimizer.state_averager.get_current_state()
    return len(metadata['optimizer_metadata']), len(all_tensors)

So new_peer.structure_shape() == first_peer.structure_shape() == (790, 637)

After the new peer requested the first for its state, it dumps its state and yields metadata and 637 tensors in 5083 parts in this function. But new peer receives metadata and only 583 tensors in 5029 parts in this loop, then calls load_optimizer_state function here with downloaded state. Since the metadata assumes a structure with 637 tensors, an error list index out of range occurs due to the fact that it received not 637 but 583 tensors. After an error, new peer sends a request to the first peer to download its state again. This is repeated until the new peer manages to get all the parts of the tensors.

Thus, I realized that for some unknown reason, the new peer does not receive all parts of the tensors from the first peer. It is this async loop that does not always return all the parts. Then I found out that this error starts to occur after Update p2pd to v0.3.8 (and libp2p to v0.17.0) commit. And before that, everything works well.

To Reproduce

Prepare environment:

git clone -b hivemind_bag https://github.com/alex-snd/TRecover.git
cd TRecover
python3 -m venv venv
source venv/bin/activate
pip install git+https://github.com/learning-at-home/hivemind.git@de6b4f5ae835a633ca7876209f2929d069e988f0
pip install -e .[collab]
trecover init
trecover download data

Run the first peer:

trecover collab train --experiment-prefix bag --batch-size 1 --bandwidth 80

After a few seconds run the second (new) peer:

trecover collab train --initial-peers /COPY/ADDRESS/FROM/FIRST/PEER/CONSOLE/OUTPUT --experiment-prefix bag --batch-size 1 --bandwidth 80

Additionally, you can reinstall the library from the previous 35851c8ce96f74b0221c4a732cc22be070f3185f commit and make sure that everything works well with it:

pip uninstall hivemind -y
pip install git+https://github.com/learning-at-home/hivemind.git@35851c8ce96f74b0221c4a732cc22be070f3185f
# and repeat the experiment above.

Environment

python version 3.8.10 or above;
hivemind version: de6b4f5ae835a633ca7876209f2929d069e988f0 commit
Output from pytorch environment collection script:
PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31
Python version: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-124-generic-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.0
[pip3] pytorch-lightning==1.6.4
[pip3] torch==1.11.0
[pip3] torchmetrics==0.9.2
[conda] Could not collect

The text was updated successfully, but these errors were encountered:

justheuristic · 2022-09-09T10:41:31Z

Thanks for the detailed report! We're gonna check if perhaps the error goes away with older/newer versions of libp2p and report back what we found.

alex-snd · 2022-09-28T19:55:42Z

I suspect that this error is caused by using the QUIC transport that is always enabled as stated here

I set the quic=True and this error started to occur even in the commit that previously worked fine.

alex-snd added the bug Something isn't working label Sep 5, 2022

alex-snd assigned justheuristic Sep 5, 2022

alex-snd changed the title ~~[BUG] Failed to load_state_from_peers at the first time because of [list index out of range] error~~ [BUG] Failed to load_state_from_peers at the first time because of "list index out of range" error Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Failed to load_state_from_peers at the first time because of "list index out of range" error #504

[BUG] Failed to load_state_from_peers at the first time because of "list index out of range" error #504

alex-snd commented Sep 5, 2022 •

edited

justheuristic commented Sep 9, 2022

alex-snd commented Sep 28, 2022

[BUG] Failed to load_state_from_peers at the first time because of "list index out of range" error #504

[BUG] Failed to load_state_from_peers at the first time because of "list index out of range" error #504

Comments

alex-snd commented Sep 5, 2022 • edited

Describe the bug

To Reproduce

Prepare environment:

Run the first peer:

After a few seconds run the second (new) peer:

Environment

justheuristic commented Sep 9, 2022

alex-snd commented Sep 28, 2022

alex-snd commented Sep 5, 2022 •

edited