Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency between the instruction template suggested vs that in the training data #96

Open
debraj135 opened this issue Nov 9, 2023 · 5 comments

Comments

@debraj135
Copy link

I noticed that the instructions in the training data end with ; and no whitespace after that.

For example 'Represent the Science sentence;' instead of 'Represent the Science sentence: '

Whereas in the readme, the proposed format seems to be 'Represent the Science sentence: ' sometimes and 'Represent the Science sentence:' in other places.

All of these three seem to be resulting in different embeddings and hence different similarity numbers. Can you please let us know what is the right instruction template?

@debraj135
Copy link
Author

Wondering if I'm missing a detail. Did anyone else also come across this?

@hongjin-su
Copy link
Collaborator

Thanks a lot for your interest in the INSTRUCTOR!

Like other LLMs, the INSTRUCTOR is sensitive to the instructions, which may be worsened by its small size. I would say all of your proposed instructions follow the basic templates, while we may need more trials or heuristics to figure out the best instruction.

@debraj135
Copy link
Author

Thank you. I had a few follow up questions

  1. Did the instruction that was provided as an input to the model during training have a semicolon ; or a colon : at the end of the instruction?
  2. Which of the above three templates that I previously mentioned were used in performing the evaluation on the MTEB leaderboard?

@debraj135
Copy link
Author

Following back on this.

@hongjin-su
Copy link
Collaborator

Sorry for the late reply!

In our training and evaluation, we may not be very strict on punctuation. We are glad to make it more consistent in our future versions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants