Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvtop doesn't display the correct GPU usage of processes on MIG devices #256

Open
Greenscreen23 opened this issue Dec 4, 2023 · 11 comments

Comments

@Greenscreen23
Copy link

When I start a Python script using Pytorch and Cuda to compute anything (e.g., matrix multiplication) on a GPU instance of a MIG-capable Nvidia GPU (in my case, an A100), nvtop displays that a process is using the GPU, but only displays 0% GPU usage.

System Specifications

  • nvtop Version: 3.0.2 (also tested with 3.0.0 (ppa), and 1.2.2 (apt))
  • Distro: Ubuntu 22.04
  • GPU: Nvidia A100, split into different configurations using MIG
  • Python 3.10
  • PyTorch 2.1.1

Steps to reproduce

  • Create the following python script

    import torch
    
    device = torch.device('cuda')
    tensor1 = torch.rand((3_000, 3_000,)).to(device)
    tensor2 = torch.rand((3_000, 3_000,)).to(device)
    
    for _ in range(3_000):
        tensor3 = tensor1 @ tensor2
  • Find the UUID of your GPU instance. It should start with MIG-

    nvidia-smi -L
  • Run the script

    CUDA_VISIBLE_DEVICES=<UUID of the MIG device> python <name of your python script>
  • While the program is running, it is displayed in nvtop, but with 0% GPU usage

    nvtop-screenshot

  • The script takes about 9 seconds to finish on a 7g.40gb GPU instance (which is the whole GPU).

If you have any more questions regarding my setup or want me to test some things, I'll be more than happy to answer your questions or test out different configurations. I don't have a different type of MIG-capable GPU or a different server to test on.

@alexmyczko
Copy link

I also tried MIG on a A100, but it was buggy for us. What is the use case for you and the MIG configuration? Multiple users? Multiple jobs/same user? Something else?

@Greenscreen23
Copy link
Author

We're simulating distributed machine learning in combination with containernet.

How would you describe your "buggy" experience with the A100? The only thing that felt buggy was disabling and enabling MIG mode, as it always thought there was a process still using the GPU. But that was fixable by rebooting the machine. I usually just leave it enabled the entire time and apply a 7g40gb partition whenever I need the entire GPU.

@alexmyczko
Copy link

it's been a long time ago, let me check if i can remember/reconstruct from history

starting at 2023-02-20T09:22:19+0100 nvidia-smi -i 0 -mig 1
some time (weeks) passed...
and the users wanted pytorch and minkowskiengine, and i think it was a problem with pytorch,
it just would fail to run. we had back then cuda11.3 and cuda11.6, also tried 11.7, 11.8. torch version was 1.13.1.

whatever it was, it worked again when MIG was disabled. (it's important to know if you turn on MIG, it stays that way after reboots)

@Greenscreen23
Copy link
Author

Greenscreen23 commented Feb 13, 2024

Did you maybe try running jobs on the GPU and not on MIG-Instances while MIG mode was enabled? That also didn't work for me. The fix is to create a 7g40gb MIG instance and let the job run on that.

@alexmyczko
Copy link

@Greenscreen23 this is very possibly possible. thanks, TIL.

@Syllo
Copy link
Owner

Syllo commented Feb 23, 2024

Hello,
I think that I might have to update the Nvidia backend to support MIG device handles.

Although there is a notice in the documentation: "In MIG mode, if device handle is provided, the API returns aggregate information, only if the caller has appropriate privileges. Per-instance information can be queried by using specific MIG device handles. Querying per-instance information using MIG device handles is not supported if the device is in vGPU Host virtualization mode."

@Greenscreen23
Copy link
Author

Hi,
Updating the backend sounds like a good idea. I'll have access to the machine next Monday and will be glad to test any fixes :)

However, the document seems to suggest that one has to query multiple handles to access the total load of the GPU. Maybe its easier to handle a GPU in MIG mode as a set of MIG instances instead of aggregating the load over all MIG devices? So like if an A100 has MIG mode disabled it is displayed like normal, and if it has MIG mode enabled, it is omitted from the visualization and each MIG device is treated like a separate GPU. Or we could show the MIG devices and the GPU with our own aggregated values. I think there might be value in visualizing each MIG device separately.

I have not yet looked into the code of nvtop, so I don't know how easy / hard this would be to implement, but I'd be happy to help :). I also don't know how well nvtop visualization scales with multiple GPUs. In my scenario, I might have 2x A100 split into 7 MIG devices each, resulting in 14 different devices.

@Syllo
Copy link
Owner

Syllo commented Feb 23, 2024

I haven't looked at the details, but from what I understood is that each MIG instance should show up as a separate a handle (device in nvtop).
Since I haven't updates the API in a while, the handles right now are only physical GPUs if I have to guess.

@Greenscreen23
Copy link
Author

Great, let me know if I can help :)

@Syllo
Copy link
Owner

Syllo commented Feb 25, 2024

All right. I have a good and a bad news:

  • The good: I updated the code to use the latest NVML API functions to retrieve Nvidia GPUs info.
  • The bad: it seems that retrieving the processes utilization is not supported in MIG mode (see comment in header). My last resort was supporting accounting, however it seems that it also cannot be enabled in MIG mode according to this comment.

So I'm out of options to provide this info in MIG mode!

@Greenscreen23
Copy link
Author

No worries, thanks for trying!

Feel free to leave this issue as it is as a reminder, in case there is some nvidia update down the road, or close it if you want to mark it as currently impossible :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants