Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] I do not understand the GPU and memory usage of SB3 #1630

Open
4 tasks done
EloyAnguiano opened this issue Jul 27, 2023 · 13 comments
Open
4 tasks done

[Question] I do not understand the GPU and memory usage of SB3 #1630

EloyAnguiano opened this issue Jul 27, 2023 · 13 comments
Labels
question Further information is requested

Comments

@EloyAnguiano
Copy link

EloyAnguiano commented Jul 27, 2023

❓ Question

I think I do not underestand the memory usage of SB3. I have a Dict observation space of some huge matrixes, so my observation space is 17MB approx:

(Pdb) [sys.getsizeof(v) for k, v in obs.items()]
[2039312, 2968, 12235248, 105800, 2968, 2968, 2968, 2039312, 116, 2039312, 2968, 2968, 2968]
(Pdb) sum([sys.getsizeof(v) for k, v in obs.items()])/1024/1024
17.623783111572266

I training a PPO agent over a Vectorized environment with the make_vec_env function at n_envs = 2 and the hyperparameters of my PPO agent are n_steps = 6 and my batch_size is 16. If I underestood well, my rollout buffer will be n_steps x n_envs = 12 so the rollout_buffer will be 17 x 12 = 204 MB. I assume that the batch_size of 16 will get the minimum so it is equivalent of having a batch size of 12.

The problem here is that when I'm using a GPU device (80GB A100) it stabilizes at 70GB of usage at the beginning and a little bit later it stops for the lack of space at the device. How is this even possible?

Checklist

@EloyAnguiano EloyAnguiano added the question Further information is requested label Jul 27, 2023
@araffin
Copy link
Member

araffin commented Jul 27, 2023

Hello,
there is an important information missing which is your network architecture.
The rollout buffer store things in the RAM not on the GPU.
And most GPU memory is taken by weights and gradients.

Might be a duplicate of #863

@EloyAnguiano
Copy link
Author

Printing my model size with this:

def print_model_size(model):
    param_size = 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
    buffer_size = 0
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()

    size_all_mb = (param_size + buffer_size) / 1024**2
    print(f'model size: {size_all_mb:.3f}MB')  # noqa: T201

And calling it like:
print_model_size(agent.policy)

Returns:
model size: 8.369MB

Is there any part of the agent that could be bigger? I am using my custom FeatureExtractor class but I assume its included at policy argument

@EloyAnguiano
Copy link
Author

Also, whenever yoou run a batch od data on GPU ypu have to transfer those data to the cuda device, so the data are in GPU at some point, isn't it?

@EloyAnguiano EloyAnguiano changed the title [Question] I do not underestanc the GPU and memory usage of SB3 [Question] I do not underestand the GPU and memory usage of SB3 Jul 27, 2023
@EloyAnguiano
Copy link
Author

I am still unable to figure out the problem in here nor in the #863 issue. There, the solution was to flatten the observation, but this does not explain anything

@EloyAnguiano EloyAnguiano changed the title [Question] I do not underestand the GPU and memory usage of SB3 [Question] I do not understand the GPU and memory usage of SB3 Oct 19, 2023
@EloyAnguiano
Copy link
Author

EloyAnguiano commented Oct 23, 2023

@araffin I think that GPU usage could be a bit more optimal. First of all, debugging the PPO class (the train method) I found that the GPU usage is a bit confusing, if I keep every hyperparameter fixed (n_steps, batch_size, etc...) but I change the number of environments at the vectorized environment, the GPU usage differs from one another:

1 environment: 1815MiB
16 environments: 8551MiB

I do not understand this as the self.rollout_buffer.size() is still the n_steps as before (32 in my case), so I do not know where does this come from. Indeed the only specifications that should affect the memory usage should be the policy size itself, the batch_size (this is key, the rollout_buffer should be at RAM and whenever we want to train with a batch, you retrieve those data to GPU) and the gradients of the model for the backpropagation.

Does this make any sense? Am I missing something?

@araffin
Copy link
Member

araffin commented Oct 23, 2023

This should answer your question:

self.observations = np.zeros((self.buffer_size, self.n_envs, *self.obs_shape), dtype=np.float32)
self.actions = np.zeros((self.buffer_size, self.n_envs, self.action_dim), dtype=np.float32)
self.rewards = np.zeros((self.buffer_size, self.n_envs), dtype=np.float32)
self.returns = np.zeros((self.buffer_size, self.n_envs), dtype=np.float32)
self.episode_starts = np.zeros((self.buffer_size, self.n_envs), dtype=np.float32)
self.values = np.zeros((self.buffer_size, self.n_envs), dtype=np.float32)
self.log_probs = np.zeros((self.buffer_size, self.n_envs), dtype=np.float32)
self.advantages = np.zeros((self.buffer_size, self.n_envs), dtype=np.float32)

@EloyAnguiano
Copy link
Author

Yes, it does. Thanks a lot. Also, this opens up my ather question and it is that I think that the rollout buffer should not be on GPU and that the GPU usage should be controlled by the batch size at each epoch. Thus, you could collect a giant rollout_buffer but train on a small but fast GPU by choosing a correct batch_size. Isn't it?

@araffin
Copy link
Member

araffin commented Oct 24, 2023

#1720 (comment)

@EloyAnguiano
Copy link
Author

Sorry, I do not understand. If rollout buffer is always on cpu, why at #1630 (comment) the number of used environments improves the GPU usage?

@EloyAnguiano
Copy link
Author

EloyAnguiano commented Oct 24, 2023

Indeed, if I debug PPO training at a GPU, I get this:

(Pdb) self.rollout_buffer.device
device(type='cuda', index=2)

This should mean that the data of the rollout_buffer are alocated at GPU

@araffin
Copy link
Member

araffin commented Oct 24, 2023

the number of used environments improves the GPU usage?

are you using subprocesses? if so, that might be due to the way python multiprocessing work.

This should mean that the data of the rollout_buffer are alocated at GPU

if you look at the code (and you should), the device is only used here:

def to_torch(self, array: np.ndarray, copy: bool = True) -> th.Tensor:
"""
Convert a numpy array to a PyTorch tensor.
Note: it copies the data by default
:param array:
:param copy: Whether to copy or not the data (may be useful to avoid changing things
by reference). This argument is inoperative if the device is not the CPU.
:return:
"""
if copy:
return th.tensor(array, device=self.device)
return th.as_tensor(array, device=self.device)

when sampling the data there:

return RolloutBufferSamples(*tuple(map(self.to_torch, data)))

@EloyAnguiano
Copy link
Author

EloyAnguiano commented Oct 24, 2023

I am creating the environment like this:

gym_env = make_vec_env(make_env,
                               env_kwargs=env_kwargs,
                               n_envs=args.n_envs,
                               vec_env_cls=SubprocVecEnv)

So I assume it uses some kind of multiprocessing, yes. What does this has to do with GPU usage?

@EloyAnguiano
Copy link
Author

Hi again @araffin . I am still unable to figure out how, it the transition of data from RolloutBuffer is done at each sampling, how can the GPU usage be so big just when the code goes into train method, as this should not have any data on GPU, only the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants