Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant drop in recall for int8 scalar quantization using maximum_inner_product #13350

Open
naveentatikonda opened this issue May 8, 2024 · 16 comments

Comments

@naveentatikonda
Copy link

Description

While running some benchmarking tests using opensearch-benchmark on int8 scalar quantization using some of the standard datasets, I observed that there is a significant drop in recall with max inner product space type when compared with other space types.

Here are some of those results

Beam Width Max Connections
100 16
S. No. Datasets Dimension of Vector Train Data Size Query Data Size Space Type fp32 hnsw Recall@100 sq int8 Recall@100
1 cohere-768-l2 768 1M 10K L2  0.94 0.87
2 cohere-768-IP 768 1M 10K Inner Product 0.94 0.36
3 lastfm-64-dot 65 292,385 50K Inner Product 0.95 0.16
4 glove-200-angular 200 1,183,514 10K cosine 0.74 0.60
5 glove-100-angular 100 1,183,514 10K cosine 0.80 0.59
6 glove-50-angular 50 1,183,514 10K cosine 0.89 0.56

The cohere-768-IP dataset can be downloaded from this link. The L2 version of cohere dataset is generated from the same dataset by recomputing ground truth. Rest all datasets are downloaded from here

Version and environment details

No response

@naveentatikonda
Copy link
Author

@benwtrent Are you aware of this recall issue with IP using SQ int8?

@benwtrent
Copy link
Member

@naveentatikonda interesting results for sure. I tested with max-inner-product & CohereV2 and didn't see a drop like this. I will try and replicate.

The cohere-768-IP, this was coherev2 correct?

@naveentatikonda
Copy link
Author

The cohere-768-IP, this was coherev2 correct?

@benwtrent Thanks for your response. I'm not exactly sure about the version of it. But, this is the dataset.
https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings

Using the link I shared in the description, we can download a 1 million vector dataset of this in hdf5 file format.

@benwtrent
Copy link
Member

@naveentatikonda using lucene-util, scalar quantization, I get a recall@100 of 0.735.

I am calculating the recall by gathering the true 100 nearest neighbors from the test queries given the training docs given max-inner-product. Then I compare the overlap with 100 nearest neighbors found via scalar quantized HNSW.

What is your flush buffer size? Maybe I need to kick off some more merges.

Do you force-merge to a single segment?

@naveentatikonda
Copy link
Author

@naveentatikonda using lucene-util, scalar quantization, I get a recall@100 of 0.735.

This looks much better. I will try to set it up and reproduce with lucene-util.

What is your flush buffer size? Maybe I need to kick off some more merges.

Sorry, I'm not sure how to set the flush buffer size and I don't think we control this parameter from opensearch.

Do you force-merge to a single segment?

Yes, I was actually using a single node cluster with 8 shards force merged to single segment.

@benwtrent
Copy link
Member

OK, I ran it again, on my index where the flush was set at 28MB & force merged. This time I ran it over all 10k queries (previously it was just 1k, as calculating the true nearest neighbors takes significant time).

Recall@100 is a stead: 0.738.

@naveentatikonda
Copy link
Author

OK, I ran it again, on my index where the flush was set at 28MB & force merged. This time I ran it over all 10k queries (previously it was just 1k, as calculating the true nearest neighbors takes significant time).

Recall@100 is a stead: 0.738.

@benwtrent Sorry for the delay in my response. Can you share more details about the dataset you used to get this recall. Is it a subset of this Cohere-wikipedia-22-12-en-embeddings dataset. I mean is the training data the first million vectors and the query data is the next 10k vectors ? If possible can you please share your dataset and the ground truth you generated through github or huggingface.

Also, have you force merged to 1 segment before running the search queries?

@benwtrent
Copy link
Member

@naveentatikonda I used the dataset you linked. I simply downloaded the file.

Ground truth is just the brute force nearest neighbors. I used the "test" set as the queries (10k of them) and "train" (1M) for the docs. Computing the true NN.

Yes, I force merged. I imagine if I didn't, recall would actually be higher.

@naveentatikonda
Copy link
Author

@benwtrent I tried to setup luceneutil but I was running into a ton of compilation errors when building it against latest lucene src code. I have a doubt in the existing quantization process of quantizing a float value in a vector.

// For 8 bits, this logic quantizes float value into uint8 range in the form of floats [0.0, 255.0]

float dx = v - minQuantile;
float dxc = Math.max(minQuantile, Math.min(maxQuantile, v)) - minQuantile;
float dxs = scale * dxc;

Here to bring it into signed int8 range of [-128, 127] we are casting it into byte. But, this leads to change of sign and magnitude for values > 127 and < 255 like 128 will be casted as -128 and 129 as -127 and so on...Is this a right way of quantizing because for spacetypes like inner product the sign matters when we are computing the distance and score ?

dest[destIndex] = (byte) Math.round(dxs);

I have added a simple unit test to show that because of this type casting, even if we dequantize the quantized vector we don't get the original vector back.

Also, the corrective offset calculation correlates with the mathematical derivation in documentation. But, for max inner product I have a simple example which is reranking the results because of this corrective offset.

minQuantile * (v - minQuantile / 2.0F) + (dx - dxq) * dxq
Let, 
V1 -> [-100.0, 20.0]
V2 -> [100.0, 20.0]
Query Vec -> [100.0, 20.0]

When we perform search against float vectors, the ranking of results are  V2, V1 with score calculation as
Q.V1 -> 1/(1+10400) = 0.00009614
Q.V2 -> 10401

When these vectors are quantized with minQuantile as -100.0 and maxQuantile as 100.0, 
the order of the results changes to V1, V2

V1 -> [-100.0, 20.0] after quantization becomes [0, -103] with corrective offset as -2000
V2 -> [100.0, 20.0] after quantization becomes [-1, 102] with corrective offset as -18000
QVec same as V2 becomes [-1, 102] with query offset as -18000

Q.V1 ->  q * v1 * alpha^2 + corrective offset + query offset 
          = -10506 * 0.61514807 -2000 -18000 = -26462.746 = 1/(1+26462.746) = 3.7787544E-5

Q.V2 -> 10406 * 0.61514807 -18000 - 18000 = -29599.385 = 1/(1+29599.385) = 3.3783344E-5

@benwtrent
Copy link
Member

Scalar quantization in Lucene is by default 7 bits. Meaning the range of values is actually 0-127.

@naveentatikonda
Copy link
Author

Scalar quantization in Lucene is by default 7 bits. Meaning the range of values is actually 0-127.

For 7 bits, it makes sense. But, the example I was referring to is for 8 bits.

@naveentatikonda
Copy link
Author

@benwtrent To be on the same page, can you please confirm if you have used 7 bits or 8 bits in your experiment that you ran above with cohere dataset using InnerProduct to get this recall@100 as 0.738 ?

@mikemccand
Copy link
Member

Thank you @naveentatikonda for the deep dive here and a nice unit test ... I couldn't follow all of the logic you described, but if we are indeed first normalizing a dimension's value v to a float [0 .. 255] and then simply (byte) Math.round(v) for the signed byte (int8) case, then that does indeed sound wrong. We should instead normalize the float v to [-128 .. 127]?

@benwtrent
Copy link
Member

benwtrent commented May 21, 2024

I used int7 for my experiments. While losing one bit of precision isn't the best, it works well.

I explored adding an unsigned byte dot product, but that got rejected as too much code.

I think for int8 to support all vector similarities, we need an unsigned dot product.

But, if we support signed int8, we should restrict it to [-127, 127], as there can be nice performance benefits if we can assume these ranges on various hardware.

I haven't done the math on Euclidean to figure out if we need an unsigned byte version of that as well.

I am out on vacation. But here is my old PR:

#12694

Maybe it's as simple as force int7 when the similarity used is max inner product and allowing signed int8 for everything else?

@jmazanec15
Copy link
Contributor

I am trying to understand one thing: Does the corrective offset for dot product rectify issues with sign shift that is caused by going from signed domain: [-x, +y] to unsigned domain: [0, 127]? Or is this handled elsewhere?

For an overly simplified example, for a data set with min = -6 and max = 6 (and for simplicity assume min max are the respective upper and lower quantiles)

# Query vectors
query_vector = [-5]
quantized_query_vector = [11]

# Index vectors
## Full precision
1: [-6]
2: [-2]
3: [3]
4: [6]

## Quantized
1: [0]
2: [42]
3: [95]
4: [127]

## Full precision Ordering of query_vector . index_vectors:
1,2,3,4

## Quantized Ordering of quantized_query_vector . quantized_index_vectors
4,3,2,1

I think I might be missing something, but how is this accounted for?

@benwtrent
Copy link
Member

@jmazanec15 this is accounted for in the corrections. The moving from signed to unsigned is still just a linear transformation, we are not manually flipping signs, but instead doing a full linear scale.

Assuming your vectors are in order that you provided [[-6][-2][3][6]]

Their calculated dot-product corrections are [18.0, -5.8750076, -35.787956, -54.0]

For query vector [-5] its correction is 11.95908

The overall quantile multiplier is 0.008928018

To calculate the corrected dot-product score quantizedDotProduct * multiplier + queryCorrection + vectorCorrection

The raw dot-products (in order of the vectors)

[30.0, 10.0, -15.0, -30.0]

The quantized dot-products with corrections

[29.95908, 10.208817, -14.499098, -29.568478]

Of course, the quantized scores without corrections (thus not accounting for linear shift and breaking max-inner product score scaling)

[0.0, 462.0, 1045.0, 1397.0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants