Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long GPU idle times in loss forward pass #1954

Open
JulianKu opened this issue Feb 22, 2024 · 1 comment · May be fixed by #1958
Open

Long GPU idle times in loss forward pass #1954

JulianKu opened this issue Feb 22, 2024 · 1 comment · May be fixed by #1958
Assignees
Labels
bug Something isn't working

Comments

@JulianKu
Copy link
Contributor

I have just implemented an RL agent for a custom environment (wrapped into a TorchRL env). I am trying to reimplement the RAPS algorithm using SAC and for that I have used the SACLoss provided by TorchRL.
Here, I mainly stuck to the examples/sac for structuring my code and setting everything up.

However, training the agent, I experienced bad GPU utilization. Profiling, I found that what takes most time is the SACLoss forward pass. I then proceeded using nsys profile in order to investigate further into this forward pass. In the screenshot attached, I have recorded a single representative forward pass through the SACLoss (after some warmup iterations).
You can see that the GPU is only utilized for short times at the start and end of the forward pass and a sligthly longer period in the middle. Is this behavior expected?
I also notice the CPU process where Python is running to be at 100. I am not sure what is causing this as all my networks are on GPU and there shouldn't be much also running during the Loss forward pass, right?

If all this not expected, how can I proceed in order to increase utilization (or first find out what is causing low utilization)?

Screenshots

Nvidia Nsight Systems Screenshot

Environment:

  • GPU: RTX 3070 Mobile
  • Python inside Conda Environment
    • Pytorch 2.1.0
    • torchrl 0.3.0
@JulianKu JulianKu added the bug Something isn't working label Feb 22, 2024
@vmoens
Copy link
Contributor

vmoens commented Feb 23, 2024

I can have a look at that! Thanks for pointing it out, will keep you posted

@vmoens vmoens linked a pull request Feb 25, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants