Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPO loss on different datasets #110

Open
wj210 opened this issue Feb 1, 2024 · 0 comments
Open

DPO loss on different datasets #110

wj210 opened this issue Feb 1, 2024 · 0 comments

Comments

@wj210
Copy link

wj210 commented Feb 1, 2024

In parallel with #38, tho i am relating to full training instead of lora.

When i use a different set of prefs (ie chosen and rejected) but still same instructions (ultrafeedback), i get extremely low eval/train loss, where it drops sharply in the beginning. In contrast to training on the original prefs as in the case of ultrafeedback_binarised.

On my pref dataset (Eval loss)
image

on original pref dataset (eval loss)
image

train loss (mine)
image

original
image

reward margin (mine)
image

original reward
image

This huge diff in scale seems to occur when i use pref datasets that are sampled from the reference policy instead of in the case of ultrafeedback, where it is sampled from various policies.

Moreover this huge decrease in loss actually cause the DPO-ed model to perform worse across various benchmarks. Is there any intuition regarding this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant