Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvtop not detecting GPUs when used with Slurm #261

Open
angel-devicente opened this issue Jan 9, 2024 · 2 comments
Open

nvtop not detecting GPUs when used with Slurm #261

angel-devicente opened this issue Jan 9, 2024 · 2 comments

Comments

@angel-devicente
Copy link

Hi,
I'm facing a problem with nvtop + Slurm interactive session. I get an interactive session in a machine with two GPUs. Slurm controls access to them, so in this particular case I'm requesting just one of them. I verify that I can use the GPU for computation, and the tool nvidia-smi detects this GPU (it shows only one, because that is what Slurm is giving me access to), but as you can see below, nvtop says that there is no GPU to monitor. I have no idea what could be going on in here. Any ideas of how to debug this issue and/or things I could try?

Thanks,

$ nvidia-smi
Tue Jan  9 07:20:01 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06              Driver Version: 545.23.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-12GB           On  | 00000000:02:00.0 Off |                    0 |
| N/A   39C    P0              25W / 250W |      0MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

$ nvtop
No GPU to monitor.
@Syllo
Copy link
Owner

Syllo commented Feb 23, 2024

Hello,
I don't know how Slurm allocates the GPUs, could you check if the library libnvidia-ml.so is available?
That's the library used to get the GPU information, nvidia-smi directly queries the driver or is statically linked to this library and hence will work without it.

@angel-devicente
Copy link
Author

It turned out that the problem seemed to come from the installed (Snap) version. The AppImage version works without issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants