Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How the training data is divided? #87

Open
wsa-dhu opened this issue Sep 27, 2023 · 3 comments
Open

How the training data is divided? #87

wsa-dhu opened this issue Sep 27, 2023 · 3 comments

Comments

@wsa-dhu
Copy link

wsa-dhu commented Sep 27, 2023

Hello, I'm very interested in your work, and I'm currently attempting to train a general sentence representation model. I have a question: When my training dataset comes from different domains, how can I ensure that samples within the same batch belong to the same task during the training process? Would it be better to include samples from different tasks within the same batch during training? I'm not sure about my assumption. Could you provide insights based on your experience?

@hongjin-su
Copy link
Collaborator

Hi, Thanks a lot for your interest in the INSTRUCTOR!

You can arrange the examples in a sequence such that, after they are divided into batches, examples in the same batch come from the same task. As we use in-batch negative sampling, it would be better if we provide meaningful negative instances from the same task.

@wsa-dhu
Copy link
Author

wsa-dhu commented Oct 15, 2023 via email

@hongjin-su
Copy link
Collaborator

hongjin-su commented Dec 19, 2023

We found that we might miss a task_id when we uploaded the dataset. We anticipate to fix it in our next version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants