Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failed to load_state_from_peers at the first time because of "list index out of range" error #504

Open
alex-snd opened this issue Sep 5, 2022 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@alex-snd
Copy link
Contributor

alex-snd commented Sep 5, 2022

Describe the bug

A new peer cannot synchronize its state at the first time with other peers because of list index out of range error. And at best, new peer succeeds only on the second attempt, at worst, it cannot synchronize its state at all.

Failed to load state from peers: list index out of range, retrying ...
Traceback (most recent call last):
  File "/home/TRecover/venv/lib/python3.8/site-packages/hivemind/optim/optimizer.py", line 694, in load_state_from_peers
    self.state_averager.load_state_from_peers(timeout=self.load_state_timeout, **kwargs)
  File "/home/TRecover/venv/lib/python3.8/site-packages/hivemind/optim/state_averager.py", line 667, in load_state_from_peers
    load_optimizer_state(self.optimizer, metadata["optimizer_metadata"], loaded_opt_tensors)
  File "/home/TRecover/venv/lib/python3.8/site-packages/hivemind/optim/state_averager.py", line 720, in load_optimizer_state
    flat_optimizer_state.append(flat_tensors])
IndexError: list index out of range

I conducted an experiment to see how the new peer synchronizes its state with another peer (next, we will call it the first).
Important clarification: the first peer and the new one have the same structure - same number of tensors on metadata (which contains all non-tensors).

def structure_shape(self) -> Tuple[int, int]:
    metadata, all_tensors, _= self.hivemind_optimizer.state_averager.get_current_state()
    return len(metadata['optimizer_metadata']), len(all_tensors)

So new_peer.structure_shape() == first_peer.structure_shape() == (790, 637)

After the new peer requested the first for its state, it dumps its state and yields metadata and 637 tensors in 5083 parts in this function. But new peer receives metadata and only 583 tensors in 5029 parts in this loop, then calls load_optimizer_state function here with downloaded state. Since the metadata assumes a structure with 637 tensors, an error list index out of range occurs due to the fact that it received not 637 but 583 tensors. After an error, new peer sends a request to the first peer to download its state again. This is repeated until the new peer manages to get all the parts of the tensors.

Thus, I realized that for some unknown reason, the new peer does not receive all parts of the tensors from the first peer. It is this async loop that does not always return all the parts. Then I found out that this error starts to occur after Update p2pd to v0.3.8 (and libp2p to v0.17.0) commit. And before that, everything works well.

To Reproduce

Prepare environment:

git clone -b hivemind_bag https://github.com/alex-snd/TRecover.git
cd TRecover
python3 -m venv venv
source venv/bin/activate
pip install git+https://github.com/learning-at-home/hivemind.git@de6b4f5ae835a633ca7876209f2929d069e988f0
pip install -e .[collab]
trecover init
trecover download data

Run the first peer:

trecover collab train --experiment-prefix bag --batch-size 1 --bandwidth 80

After a few seconds run the second (new) peer:

trecover collab train --initial-peers /COPY/ADDRESS/FROM/FIRST/PEER/CONSOLE/OUTPUT --experiment-prefix bag --batch-size 1 --bandwidth 80

Additionally, you can reinstall the library from the previous 35851c8ce96f74b0221c4a732cc22be070f3185f commit and make sure that everything works well with it:

pip uninstall hivemind -y
pip install git+https://github.com/learning-at-home/hivemind.git@35851c8ce96f74b0221c4a732cc22be070f3185f
# and repeat the experiment above.

Environment

  • python version 3.8.10 or above;
  • hivemind version: de6b4f5ae835a633ca7876209f2929d069e988f0 commit
  • Output from pytorch environment collection script:
    PyTorch version: 1.11.0+cu102
    Is debug build: False
    CUDA used to build PyTorch: 10.2
    ROCM used to build PyTorch: N/A
    OS: Ubuntu 20.04.4 LTS (x86_64)
    GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
    Clang version: Could not collect
    CMake version: Could not collect
    Libc version: glibc-2.31
    Python version: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] (64-bit runtime)
    Python platform: Linux-5.4.0-124-generic-x86_64-with-glibc2.29
    Is CUDA available: False
    CUDA runtime version: No CUDA
    GPU models and configuration: No CUDA
    Nvidia driver version: No CUDA
    cuDNN version: No CUDA
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    Is XNNPACK available: True
    Versions of relevant libraries:
    [pip3] mypy-extensions==0.4.3
    [pip3] numpy==1.22.0
    [pip3] pytorch-lightning==1.6.4
    [pip3] torch==1.11.0
    [pip3] torchmetrics==0.9.2
    [conda] Could not collect
@alex-snd alex-snd added the bug Something isn't working label Sep 5, 2022
@alex-snd alex-snd changed the title [BUG] Failed to load_state_from_peers at the first time because of [list index out of range] error [BUG] Failed to load_state_from_peers at the first time because of "list index out of range" error Sep 6, 2022
@justheuristic
Copy link
Member

Thanks for the detailed report! We're gonna check if perhaps the error goes away with older/newer versions of libp2p and report back what we found.

@alex-snd
Copy link
Contributor Author

I suspect that this error is caused by using the QUIC transport that is always enabled as stated here

I set the quic=True and this error started to occur even in the commit that previously worked fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants