not getting exactly the same embedding for different batchsize #76

kirnap · 2023-08-10T16:44:58Z

Hi,

I recently discovered that model.encode method does not give exactly the same embedding for different batch_size values. However, they're still close when I play with atol (absolute tolerance). Is this an expected behaviour or something buggy?

You may find minimal code snippet to replicate the conflicting embeddings:

import pandas as pd
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')

query_instruction = 'Represent the Movie query for retrieving similar movies or tv shows: '
s1 = 'word'


batch = [[query_instruction, s1], 
         [query_instruction, s1], 
         [query_instruction, s1], 
         [query_instruction, s1]]

bbig2 = model.encode(batch, batch_size=2)
bbig4 = model.encode(batch, batch_size=4)
bbig1 = model.encode(batch, batch_size=1)

import numpy as np
if not np.allclose(bbig4, bbig1, atol=1e-8):
    print('Different batchsize is not close for 1e-8 absolute tolerance')
if np.allclose(bbig4, bbig1, atol=1e-7):
    print('Different batchsize is close enough for 1e-7 absolute tolerance')

This prints out the following results:

Different batchsize is not close for 1e-8 absolute tolerance
Different batchsize is close enough for 1e-7 absolute tolerance

thanks in advance!

The text was updated successfully, but these errors were encountered:

aditya-y47 · 2023-09-25T04:11:48Z

Any more findings on this yet?

kirnap · 2023-09-25T10:34:31Z

Not from my end

dkirman-re · 2023-11-20T18:25:54Z

Most likely something to do with the underlying HF transformers package. It's a lot of finger pointing, but still no resolution at this point unfortunately.
Relevant Github Issues:
UKPLab/sentence-transformers#2312
huggingface/transformers#2401

eyalyoli · 2024-01-24T07:31:11Z

I'm having the same issue, it tried manipulating other things like order or content of the batch, the only factor that affects this is the batch size.

ayalaall · 2024-01-24T09:30:53Z

Same here. I'm getting a different embedding for different batch_size. The embeddings start to differ from about the 7 decimal point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

not getting exactly the same embedding for different batchsize #76

not getting exactly the same embedding for different batchsize #76

kirnap commented Aug 10, 2023

aditya-y47 commented Sep 25, 2023

kirnap commented Sep 25, 2023

dkirman-re commented Nov 20, 2023

eyalyoli commented Jan 24, 2024

ayalaall commented Jan 24, 2024

not getting exactly the same embedding for different batchsize #76

not getting exactly the same embedding for different batchsize #76

Comments

kirnap commented Aug 10, 2023

aditya-y47 commented Sep 25, 2023

kirnap commented Sep 25, 2023

dkirman-re commented Nov 20, 2023

eyalyoli commented Jan 24, 2024

ayalaall commented Jan 24, 2024