Significant drop in recall for int8 scalar quantization using maximum_inner_product #13350

naveentatikonda · 2024-05-08T04:16:24Z

Description

While running some benchmarking tests using opensearch-benchmark on int8 scalar quantization using some of the standard datasets, I observed that there is a significant drop in recall with max inner product space type when compared with other space types.

Here are some of those results

Beam Width	Max Connections
100	16

S. No.	Datasets	Dimension of Vector	Train Data Size	Query Data Size	Space Type	fp32 hnsw Recall@100	sq int8 Recall@100
1	cohere-768-l2	768	1M	10K	L2	0.94	0.87
2	cohere-768-IP	768	1M	10K	Inner Product	0.94	0.36
3	lastfm-64-dot	65	292,385	50K	Inner Product	0.95	0.16
4	glove-200-angular	200	1,183,514	10K	cosine	0.74	0.60
5	glove-100-angular	100	1,183,514	10K	cosine	0.80	0.59
6	glove-50-angular	50	1,183,514	10K	cosine	0.89	0.56

The cohere-768-IP dataset can be downloaded from this link. The L2 version of cohere dataset is generated from the same dataset by recomputing ground truth. Rest all datasets are downloaded from here

Version and environment details

No response

The text was updated successfully, but these errors were encountered:

naveentatikonda · 2024-05-08T04:21:57Z

@benwtrent Are you aware of this recall issue with IP using SQ int8?

benwtrent · 2024-05-08T14:29:38Z

@naveentatikonda interesting results for sure. I tested with max-inner-product & CohereV2 and didn't see a drop like this. I will try and replicate.

The cohere-768-IP, this was coherev2 correct?

naveentatikonda · 2024-05-08T17:33:35Z

The cohere-768-IP, this was coherev2 correct?

@benwtrent Thanks for your response. I'm not exactly sure about the version of it. But, this is the dataset.
https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings

Using the link I shared in the description, we can download a 1 million vector dataset of this in hdf5 file format.

benwtrent · 2024-05-09T19:26:43Z

@naveentatikonda using lucene-util, scalar quantization, I get a recall@100 of 0.735.

I am calculating the recall by gathering the true 100 nearest neighbors from the test queries given the training docs given max-inner-product. Then I compare the overlap with 100 nearest neighbors found via scalar quantized HNSW.

What is your flush buffer size? Maybe I need to kick off some more merges.

Do you force-merge to a single segment?

naveentatikonda · 2024-05-09T20:12:59Z

@naveentatikonda using lucene-util, scalar quantization, I get a recall@100 of 0.735.

This looks much better. I will try to set it up and reproduce with lucene-util.

What is your flush buffer size? Maybe I need to kick off some more merges.

Sorry, I'm not sure how to set the flush buffer size and I don't think we control this parameter from opensearch.

Do you force-merge to a single segment?

Yes, I was actually using a single node cluster with 8 shards force merged to single segment.

benwtrent · 2024-05-09T22:40:04Z

OK, I ran it again, on my index where the flush was set at 28MB & force merged. This time I ran it over all 10k queries (previously it was just 1k, as calculating the true nearest neighbors takes significant time).

Recall@100 is a stead: 0.738.

naveentatikonda · 2024-05-14T16:44:21Z

OK, I ran it again, on my index where the flush was set at 28MB & force merged. This time I ran it over all 10k queries (previously it was just 1k, as calculating the true nearest neighbors takes significant time).

Recall@100 is a stead: 0.738.

@benwtrent Sorry for the delay in my response. Can you share more details about the dataset you used to get this recall. Is it a subset of this Cohere-wikipedia-22-12-en-embeddings dataset. I mean is the training data the first million vectors and the query data is the next 10k vectors ? If possible can you please share your dataset and the ground truth you generated through github or huggingface.

Also, have you force merged to 1 segment before running the search queries?

benwtrent · 2024-05-14T17:13:33Z

@naveentatikonda I used the dataset you linked. I simply downloaded the file.

Ground truth is just the brute force nearest neighbors. I used the "test" set as the queries (10k of them) and "train" (1M) for the docs. Computing the true NN.

Yes, I force merged. I imagine if I didn't, recall would actually be higher.

naveentatikonda · 2024-05-17T20:11:14Z

@benwtrent I tried to setup luceneutil but I was running into a ton of compilation errors when building it against latest lucene src code. I have a doubt in the existing quantization process of quantizing a float value in a vector.

// For 8 bits, this logic quantizes float value into uint8 range in the form of floats [0.0, 255.0]

float dx = v - minQuantile;
float dxc = Math.max(minQuantile, Math.min(maxQuantile, v)) - minQuantile;
float dxs = scale * dxc;

Here to bring it into signed int8 range of [-128, 127] we are casting it into byte. But, this leads to change of sign and magnitude for values > 127 and < 255 like 128 will be casted as -128 and 129 as -127 and so on...Is this a right way of quantizing because for spacetypes like inner product the sign matters when we are computing the distance and score ?

dest[destIndex] = (byte) Math.round(dxs);

I have added a simple unit test to show that because of this type casting, even if we dequantize the quantized vector we don't get the original vector back.

Also, the corrective offset calculation correlates with the mathematical derivation in documentation. But, for max inner product I have a simple example which is reranking the results because of this corrective offset.

minQuantile * (v - minQuantile / 2.0F) + (dx - dxq) * dxq

Let, 
V1 -> [-100.0, 20.0]
V2 -> [100.0, 20.0]
Query Vec -> [100.0, 20.0]

When we perform search against float vectors, the ranking of results are  V2, V1 with score calculation as
Q.V1 -> 1/(1+10400) = 0.00009614
Q.V2 -> 10401

When these vectors are quantized with minQuantile as -100.0 and maxQuantile as 100.0, 
the order of the results changes to V1, V2

V1 -> [-100.0, 20.0] after quantization becomes [0, -103] with corrective offset as -2000
V2 -> [100.0, 20.0] after quantization becomes [-1, 102] with corrective offset as -18000
QVec same as V2 becomes [-1, 102] with query offset as -18000

Q.V1 ->  q * v1 * alpha^2 + corrective offset + query offset 
          = -10506 * 0.61514807 -2000 -18000 = -26462.746 = 1/(1+26462.746) = 3.7787544E-5

Q.V2 -> 10406 * 0.61514807 -18000 - 18000 = -29599.385 = 1/(1+29599.385) = 3.3783344E-5

benwtrent · 2024-05-17T22:07:58Z

Scalar quantization in Lucene is by default 7 bits. Meaning the range of values is actually 0-127.

naveentatikonda · 2024-05-19T00:01:32Z

Scalar quantization in Lucene is by default 7 bits. Meaning the range of values is actually 0-127.

For 7 bits, it makes sense. But, the example I was referring to is for 8 bits.

naveentatikonda · 2024-05-20T22:01:12Z

@benwtrent To be on the same page, can you please confirm if you have used 7 bits or 8 bits in your experiment that you ran above with cohere dataset using InnerProduct to get this recall@100 as 0.738 ?

mikemccand · 2024-05-21T13:50:28Z

Thank you @naveentatikonda for the deep dive here and a nice unit test ... I couldn't follow all of the logic you described, but if we are indeed first normalizing a dimension's value v to a float [0 .. 255] and then simply (byte) Math.round(v) for the signed byte (int8) case, then that does indeed sound wrong. We should instead normalize the float v to [-128 .. 127]?

benwtrent · 2024-05-21T15:04:57Z

I used int7 for my experiments. While losing one bit of precision isn't the best, it works well.

I explored adding an unsigned byte dot product, but that got rejected as too much code.

I think for int8 to support all vector similarities, we need an unsigned dot product.

But, if we support signed int8, we should restrict it to [-127, 127], as there can be nice performance benefits if we can assume these ranges on various hardware.

I haven't done the math on Euclidean to figure out if we need an unsigned byte version of that as well.

I am out on vacation. But here is my old PR:

#12694

Maybe it's as simple as force int7 when the similarity used is max inner product and allowing signed int8 for everything else?

jmazanec15 · 2024-05-21T17:31:13Z

I am trying to understand one thing: Does the corrective offset for dot product rectify issues with sign shift that is caused by going from signed domain: [-x, +y] to unsigned domain: [0, 127]? Or is this handled elsewhere?

For an overly simplified example, for a data set with min = -6 and max = 6 (and for simplicity assume min max are the respective upper and lower quantiles)

# Query vectors
query_vector = [-5]
quantized_query_vector = [11]

# Index vectors
## Full precision
1: [-6]
2: [-2]
3: [3]
4: [6]

## Quantized
1: [0]
2: [42]
3: [95]
4: [127]

## Full precision Ordering of query_vector . index_vectors:
1,2,3,4

## Quantized Ordering of quantized_query_vector . quantized_index_vectors
4,3,2,1

I think I might be missing something, but how is this accounted for?

benwtrent · 2024-05-29T13:03:42Z

@jmazanec15 this is accounted for in the corrections. The moving from signed to unsigned is still just a linear transformation, we are not manually flipping signs, but instead doing a full linear scale.

Assuming your vectors are in order that you provided [[-6][-2][3][6]]

Their calculated dot-product corrections are [18.0, -5.8750076, -35.787956, -54.0]

For query vector [-5] its correction is 11.95908

The overall quantile multiplier is 0.008928018

To calculate the corrected dot-product score quantizedDotProduct * multiplier + queryCorrection + vectorCorrection

The raw dot-products (in order of the vectors)

[30.0, 10.0, -15.0, -30.0]

The quantized dot-products with corrections

[29.95908, 10.208817, -14.499098, -29.568478]

Of course, the quantized scores without corrections (thus not accounting for linear shift and breaking max-inner product score scaling)

[0.0, 462.0, 1045.0, 1397.0]

naveentatikonda added the type:bug label May 8, 2024

alessandrobenedetti added the vector-based-search label May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant drop in recall for int8 scalar quantization using maximum_inner_product #13350

Significant drop in recall for int8 scalar quantization using maximum_inner_product #13350

naveentatikonda commented May 8, 2024

naveentatikonda commented May 8, 2024

benwtrent commented May 8, 2024

naveentatikonda commented May 8, 2024

benwtrent commented May 9, 2024

naveentatikonda commented May 9, 2024

benwtrent commented May 9, 2024

naveentatikonda commented May 14, 2024

benwtrent commented May 14, 2024

naveentatikonda commented May 17, 2024

benwtrent commented May 17, 2024

naveentatikonda commented May 19, 2024

naveentatikonda commented May 20, 2024

mikemccand commented May 21, 2024

benwtrent commented May 21, 2024 •

edited

jmazanec15 commented May 21, 2024

benwtrent commented May 29, 2024

Significant drop in recall for int8 scalar quantization using maximum_inner_product #13350

Significant drop in recall for int8 scalar quantization using maximum_inner_product #13350

Comments

naveentatikonda commented May 8, 2024

Description

Version and environment details

naveentatikonda commented May 8, 2024

benwtrent commented May 8, 2024

naveentatikonda commented May 8, 2024

benwtrent commented May 9, 2024

naveentatikonda commented May 9, 2024

benwtrent commented May 9, 2024

naveentatikonda commented May 14, 2024

benwtrent commented May 14, 2024

naveentatikonda commented May 17, 2024

benwtrent commented May 17, 2024

naveentatikonda commented May 19, 2024

naveentatikonda commented May 20, 2024

mikemccand commented May 21, 2024

benwtrent commented May 21, 2024 • edited

jmazanec15 commented May 21, 2024

benwtrent commented May 29, 2024

benwtrent commented May 21, 2024 •

edited