Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should critic's input be prompt only? #57

Open
ginward opened this issue Nov 27, 2023 · 0 comments
Open

Should critic's input be prompt only? #57

ginward opened this issue Nov 27, 2023 · 0 comments

Comments

@ginward
Copy link

ginward commented Nov 27, 2023

In the PPO implementation, it seems that the critic model considers both prompt and generated actions as the input (if pooled is true, then generated actions only). However, if we see prompt as S_t and prompt with action as S_t+T, shouldn't the value function be V(S_t) but not V(S_t+T)?

In other words, when calculating the advantage function, shouldn't our value function be the average reward for a prompt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant