Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The loss function of reward model. #22

Open
huzechuan opened this issue Jan 31, 2023 · 2 comments
Open

The loss function of reward model. #22

huzechuan opened this issue Jan 31, 2023 · 2 comments

Comments

@huzechuan
Copy link

Hi, I am confused that the loss function of ChatGPT's reward model takes as input the difference of two responses and then passes a sigmoid function. However, the loss function in this repo only takes one response as input and uses the ranking score as a label to calculate the CE loss. Is there an advantage to this?

@lucidrains
Copy link
Owner

lucidrains commented Jan 31, 2023

@huzechuan i have to admit i haven't totally digested the way they derive their reward values for training

but at the moment, even if their reward is derived from a collection of sampled responses, this repository doesn't lock you into any one method, as you can do your second step (training the reward model) from any <sequence, reward value> pair, which you define

i guess i'll have to worry about this once i build out the application for sampling from some version of the model and collecting the ratings, so do let me know in detail the optimal way they discovered. i just think there are other applications beyond text that this could be used for (rl, protein design), that does not necessarily need this sigmoid of difference approach

@yangjianxin1
Copy link

Hi, I am confused that the loss function of ChatGPT's reward model takes as input the difference of two responses and then passes a sigmoid function. However, the loss function in this repo only takes one response as input and uses the ranking score as a label to calculate the CE loss. Is there an advantage to this?

I have the same confusion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants