Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not getting exactly the same embedding for different batchsize #76

Open
kirnap opened this issue Aug 10, 2023 · 5 comments
Open

not getting exactly the same embedding for different batchsize #76

kirnap opened this issue Aug 10, 2023 · 5 comments

Comments

@kirnap
Copy link

kirnap commented Aug 10, 2023

Hi,

I recently discovered that model.encode method does not give exactly the same embedding for different batch_size values. However, they're still close when I play with atol (absolute tolerance). Is this an expected behaviour or something buggy?

You may find minimal code snippet to replicate the conflicting embeddings:

import pandas as pd
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')

query_instruction = 'Represent the Movie query for retrieving similar movies or tv shows: '
s1 = 'word'


batch = [[query_instruction, s1], 
         [query_instruction, s1], 
         [query_instruction, s1], 
         [query_instruction, s1]]

bbig2 = model.encode(batch, batch_size=2)
bbig4 = model.encode(batch, batch_size=4)
bbig1 = model.encode(batch, batch_size=1)

import numpy as np
if not np.allclose(bbig4, bbig1, atol=1e-8):
    print('Different batchsize is not close for 1e-8 absolute tolerance')
if np.allclose(bbig4, bbig1, atol=1e-7):
    print('Different batchsize is close enough for 1e-7 absolute tolerance')

This prints out the following results:

Different batchsize is not close for 1e-8 absolute tolerance
Different batchsize is close enough for 1e-7 absolute tolerance

thanks in advance!

@aditya-y47
Copy link

Any more findings on this yet?

@kirnap
Copy link
Author

kirnap commented Sep 25, 2023

Not from my end

@dkirman-re
Copy link

Most likely something to do with the underlying HF transformers package. It's a lot of finger pointing, but still no resolution at this point unfortunately.
Relevant Github Issues:
UKPLab/sentence-transformers#2312
huggingface/transformers#2401

@eyalyoli
Copy link

I'm having the same issue, it tried manipulating other things like order or content of the batch, the only factor that affects this is the batch size.

@ayalaall
Copy link

Same here. I'm getting a different embedding for different batch_size. The embeddings start to differ from about the 7 decimal point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants