Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning for sentence comparison #69

Open
Mukish45 opened this issue Jul 27, 2023 · 3 comments
Open

Fine-tuning for sentence comparison #69

Mukish45 opened this issue Jul 27, 2023 · 3 comments

Comments

@Mukish45
Copy link

Thanks a lot for open sourcing your excellent job. I would like to further fine-tune your model for comparing two sentences and getting their similarity scores. You guys made the MEDI dataset as a general format for retrieving, pairwise classification, clustering etc with {query, pos, neg, task_name}. For my need, I want to compare two sentences by encoding and finding cosine similarity. Then what should be the format for my training set? I think neg sentences might not need for this(If needed, why?).

Please assist me with this. Thank you

@hongjin-su
Copy link
Collaborator

Hi, Thanks a lot for your interest in the INSTRUCTOR!

You may follow https://github.com/HKUNLP/instructor-embedding#data to prepare the training set. The negative sentences are necessary for the training, because the model not only needs to learn to minimize the distance between positive pairs, but also needs to learn to maximize the distance between negative pairs.

@Mukish45
Copy link
Author

Mukish45 commented Aug 3, 2023

@hongjin-su Thank you for the clear explanation. I have one more doubt, can we make the model to run on multi-threads. Because it takes 1 second to encode 2 sentences. I want to increase its encoding speed. If there is a way, please let me know.

@hongjin-su
Copy link
Collaborator

An easy way to achieve the same effect would be to split the data. If you split the data into different pieces, then you can encode them separately without considering the communications between different threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants