Can we add configuration on dropping raw vectors from quantized formats after some period of time? #13251

benwtrent · 2024-03-29T19:04:06Z

Description

Tangentially related to: #13158

But, I have observed, that as the corpus reaches a fairly large size, the actual quantiles aren't changing much at all during segment merges. This is tricky to fully measure and make a promise about lossiness (users can always just start throwing garbage that shakes up the whole world). But if the data isn't a "bad actor", quantiles and quantization buckets become fairly stable over time.

Maybe we should add a configuration option, or a new codec, or some way to drop the raw floating point vectors. The 4x reduction in disk usage would be really nice for many use-cases.

I am not 100% sure how this would look (a threshold provided by the user, or we just do it based on internal statistics).

mikemccand · 2024-04-01T16:09:47Z

This is a neat idea -- it would allow the user to accept some "lossy compression" when they know/expect that loss will be minor for their use case. Sort of like JPEG vs RAW image encoding.

One question (I don't know enough about how the HNSW merging works): if we did this, and segments with these "only the quantized vectors remain" are merged, we would have to use those quantized vectors to build the next HNSW graph right? (Whereas today we always go back to the full precision vectors to build the graph for the newly merged segment?). Or are we always using the quantized vectors to build the graph?

I suppose, if using the quantized vectors at search time is not hurting much, because the "quantization noise" in the resulting distance computation between two vectors is negligible, then building the graph off of the quantized forms should also not hurt much since that graph building is really just doing a bunch of searching to get the top K vectors that should be linked up in the graph?

benwtrent added type:enhancement vector-based-search labels Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we add configuration on dropping raw vectors from quantized formats after some period of time? #13251

Can we add configuration on dropping raw vectors from quantized formats after some period of time? #13251

benwtrent commented Mar 29, 2024

mikemccand commented Apr 1, 2024

Can we add configuration on dropping raw vectors from quantized formats after some period of time? #13251

Can we add configuration on dropping raw vectors from quantized formats after some period of time? #13251

Comments

benwtrent commented Mar 29, 2024

Description

mikemccand commented Apr 1, 2024