MNTP Question #73

bdytx5 · 2024-05-15T17:22:19Z

Hi, great work on this!

Just had a question about the MNTP. In the paper, you mention " when predicting a masked token at position i, we compute the loss based on the logits obtained from the token representation at the previous position i − 1, not the masked position itself "

I was a bit confused about this and also why this is? Could you provide a more detailed explanation to this and the intuition behind it?

Thanks,
Brett

vaibhavad · 2024-05-16T17:45:56Z

Hi @bdytx5,

thanks for your interest in our work. We did this to align our training objective with the pre-training setup of decoder-only LLMs. Decoder only LMs are trained to predict the token at position i by using the embedding of token at position i-1. By making sure our training objective follows a similar pattern, the intuition is that we will maximally use the inherent capabilities of the model.

Let me know if you have any further questions.

bdytx5 · 2024-05-19T19:12:31Z

ok, thanks!

vaibhavad closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNTP Question #73

MNTP Question #73

bdytx5 commented May 15, 2024

vaibhavad commented May 16, 2024

bdytx5 commented May 19, 2024

MNTP Question #73

MNTP Question #73

Comments

bdytx5 commented May 15, 2024

vaibhavad commented May 16, 2024

bdytx5 commented May 19, 2024