You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
curiosity_rewards has shape (num_steps, num_envs).
Then, when for loop over curiosity_rewards.cpu().data.numpy().T, we are looping over the num_envs, not num_steps.
So, we are calculating the intrinsic "return" w/ running exponential sum of rewards over environments.
We're supposed to calculate the intrinsic "return" w/ running exponential sum of rewards over timesteps.
It ends up not mattering too much, since we only need an approximate scale of the reward to normalize them, but is still wrong.
Hi Akarsh, it's indeed a bug. Can u provide a comparison plot of before and after the fix? (I don't see PR tho)
At the same time, can u also rename curiosity_reward_per_env to curiosity_reward_per_step
Problem Description
The CleanRL has the following lines of code:
curiosity_rewards
has shape(num_steps, num_envs)
.Then, when for loop over
curiosity_rewards.cpu().data.numpy().T
, we are looping over the num_envs, not num_steps.So, we are calculating the intrinsic "return" w/ running exponential sum of rewards over environments.
We're supposed to calculate the intrinsic "return" w/ running exponential sum of rewards over timesteps.
It ends up not mattering too much, since we only need an approximate scale of the reward to normalize them, but is still wrong.
In the original implementation of RND:
They do
rffs_int = np.array([self.I.rff_int.update(rew) for rew in self.I.buf_rews_int.T])
,but their
buf_rews_int
has shape(num_envs, num_steps)
, so it makes sense to do the transpose.In CleanRL, the shape is already reversed, so transposing makes it loop over environments.
Another issue is that in the original implementation, they update the RMS with all the rewards in the batch using
self.I.rff_rms_int.update(rffs_int.ravel())
.CleanRL updates the RMS using
count=len(curiosity_reward_per_env)
which is equal tonum_envs
.Possible Solution
I have fixed both of these issues in the following possible solution.
I also opened up a pull request with these fixes.
@yooceii @vwxyzjn
The text was updated successfully, but these errors were encountered: