DPO loss on different datasets #110

wj210 · 2024-02-01T15:49:29Z

In parallel with #38, tho i am relating to full training instead of lora.

When i use a different set of prefs (ie chosen and rejected) but still same instructions (ultrafeedback), i get extremely low eval/train loss, where it drops sharply in the beginning. In contrast to training on the original prefs as in the case of ultrafeedback_binarised.

On my pref dataset (Eval loss)

on original pref dataset (eval loss)

train loss (mine)

original

reward margin (mine)

original reward

This huge diff in scale seems to occur when i use pref datasets that are sampled from the reference policy instead of in the case of ultrafeedback, where it is sampled from various policies.

Moreover this huge decrease in loss actually cause the DPO-ed model to perform worse across various benchmarks. Is there any intuition regarding this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPO loss on different datasets #110

DPO loss on different datasets #110

wj210 commented Feb 1, 2024

DPO loss on different datasets #110

DPO loss on different datasets #110

Comments

wj210 commented Feb 1, 2024