Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

forking before initialization of the MPFuture handler - server runtime not initialized in WSL --new_hive #581

Open
poedator opened this issue Aug 1, 2023 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@poedator
Copy link

poedator commented Aug 1, 2023

Apparently there is an issue of forking before initialization of the MPFuture handler. It happens when running private hive under WSL. No such problem happens when runnging pure linux or WSL with real hive.

What likely happens:
--> The ConnectionHandlers were created - as a process - while we were still initializing the background thread - the MPFuture handler - and they forked the state of the main process, in which the MPFuture handler was not fully initialized - and was broken.
--> the ConnectionHandlers could not write self.ready.set_result - this method is serviced by their (broken) MPFuture handler.
--> the ModuleContainer got stuck in run -> handler.run_in_background -> handler.wait_until_ready().
--> the ModuleContainer never reached Runtime.run() - it was called after handler.run_in_background().
--> the Runtime never started processing batches.
--> the connection handlers, which received a request from the client, gave the task to the Runtime, but did not receive a response - the Runtime was never launched.
--> the client did not receive a response from the server.

How to reproduce:
run in WSL: HIVEMIND_LOGLEVEL=DEBUG python -m petals.cli.run_server bigscience/bloom-560m --new_swarm --identity tests/test.id --host_maddrs /ip4/127.0.0.1/tcp/32337 --throughput 1 --torch_dtype float32 --compression NONE --attn_cache_tokens 2048 --max_chunk_size_bytes 1024

Problem symptoms:
Server runtime seems inactive.

  • no message Started
  • not responding to client calls

Environment
Please list:

If the script doesn't work, please report pytorch and numpy versions manually. We also encourage you to include any additional information that you believe can help us solve the issue.

PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 531.79
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 8
Model name: AMD Ryzen 7 2700X Eight-Core Processor
Stepping: 2
CPU MHz: 3700.062
BogoMIPS: 7400.12
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 256 KiB
L1i cache: 512 KiB
L2 cache: 4 MiB
L3 cache: 16 MiB
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT vulnerable
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr virt_ssbd arat

Versions of relevant libraries:
[pip3] mypy==0.991
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.4
[pip3] torch==2.0.1
[pip3] triton==2.0.0
[conda] blas 2.16 mkl conda-forge
[conda] libblas 3.8.0 16_mkl conda-forge
[conda] libcblas 3.8.0 16_mkl conda-forge
[conda] liblapack 3.8.0 16_mkl conda-forge
[conda] liblapacke 3.8.0 16_mkl conda-forge
[conda] mkl 2020.2 256
[conda] numpy 1.25.1 pypi_0 pypi
[conda] pytorch 2.0.1 py3.10_cuda11.7_cudnn8.5.0_0 pytorch
[conda] pytorch-cuda 11.7 h778d358_5 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchtriton 2.0.0 py310 pytorch

@poedator poedator added the bug Something isn't working label Aug 1, 2023
@poedator
Copy link
Author

poedator commented Aug 3, 2023

also observed this issue on Linux, (private swarm, Bloom-560)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants