Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discounted Reward Calulcation (Generalized Advantage Estimation) #38

Open
artest08 opened this issue Jan 4, 2021 · 5 comments
Open
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@artest08
Copy link

artest08 commented Jan 4, 2021

I want to ask one more thing about the estimation of discounted reward. The variable discounted reward always starts with zero. However, if the episode is not ended, should it be the value estimation from the critic network?

In other words, I think the pseudo code for my suggestion is
if is_terminal:
discounted_reward = 0
else:
discounted_reward = critic_network(final_state)

@gianlucadest
Copy link

You are completely right. If the episode didn't end, you use the critic network (or the critic head of your twinheaded actor critic network) to approximate V(final_state)

@nikhilbarhate99
Copy link
Owner

nikhilbarhate99 commented Apr 20, 2021

First, this repository does NOT use Generalized Advantage Estimation; it uses monte-carlo estimate for calculating rewards_to_go (reward variable in code) and advantages = rewards_to_go - V(s_t).

The only time we will get an unfinished trajectory is at the end. So an accurate version would be :

# Monte Carlo estimate of returns

rewards = []

if self.buffer.is_terminals[-1]:
   discounted_reward = 0
else:
   discounted_reward = self.policy_old.critic(self.buffer.states[-1]).item()

for reward, is_terminal in zip(reversed(self.buffer.rewards), reversed(self.buffer.is_terminals)):
    if is_terminal:
        discounted_reward = 0
    discounted_reward = reward + (self.gamma * discounted_reward)
    rewards.insert(0, discounted_reward)

Also, the rewards to go calculation introduced in issue #8 seems to be wrong. I am a little busy now and I will look into it later.

So the correct version, as other implementations use, might be just :

# Monte Carlo estimate of returns

rewards = []

if self.buffer.is_terminals[-1]:
   discounted_reward = 0
else:
   discounted_reward = self.policy_old.critic(self.buffer.states[-1]).item()

for reward in reversed(self.buffer.rewards):
    discounted_reward = reward + (self.gamma * discounted_reward)
    rewards.insert(0, discounted_reward)

@nikhilbarhate99 nikhilbarhate99 added help wanted Extra attention is needed question Further information is requested labels Apr 20, 2021
@gianlucadest
Copy link

gianlucadest commented Apr 20, 2021

Your adjusted implementation is fine. I use the same semantic for my A2C rollouts where unfinished episodes are processed by calling critic(last_state). Else, the code just works with finished episodes.

@Boltzmachine
Copy link

Boltzmachine commented Jan 15, 2022

This inaccuracy (maybe I should call it a bug) troubles me a lot! Thanks @nikhilbarhate99
Also, I guess there is another issue.

if the loop exit with t >= max_ep_len, A False will be added into buffer.is_terminals ([False, False, ..., False]). When the next episode begins, the buffer continues appending steps in the next episode without clearing the buffer if it did not call update ([False, False, ... False] (first episode) + [False, False, ..., True] (next episode) ). So when we calculated the accumulated rewards, the rewards will be accumulated through the two episodes, which is not what we expect.

@11chens
Copy link

11chens commented Oct 31, 2023

This inaccuracy (maybe I should call it a bug) troubles me a lot! Thanks @nikhilbarhate99 Also, I guess there is another issue.

if the loop exit with t >= max_ep_len, A False will be added into buffer.is_terminals ([False, False, ..., False]). When the next episode begins, the buffer continues appending steps in the next episode without clearing the buffer if it did not call update ([False, False, ... False] (first episode) + [False, False, ..., True] (next episode) ). So when we calculated the accumulated rewards, the rewards will be accumulated through the two episodes, which is not what we expect.

I think we can follow the practice of the latest version of gym: add a variable truncated (bool): indicates whether the game is still incomplete when t=max_ep_len. Then in the discounted_reward loop add:

if truncated : discounted_reward = self.policy_old.critic(self.buffer.states[-1]).item() 
elif is_terminal: discounted_reward = 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants