Add a double addressing vector scorer #13370

ChrisHegarty · 2024-05-15T10:10:02Z

This commit adds a method to RandomVectorScorerSupplier that allows to score two vectors based their ordinals.

The existing model of this API first creates a scorer, that effectively binds the ordinal of the first vector, to then score the ordinal of a second vector agains the first. This results in a RandomVectorScorer instance being created each time we want to score against a different vector in the first position. Allowing to score against two given ordinals avoids the creation of a RandomVectorScorer instance, which is likely expensive during graph building.

The new API could seen as a convenience for scorer(ord).score(node), or vice versa, but in fact there can be very different implementation characteristics of these - most notably the avoidance of the creation of a RandomVectorScorer instance.

This PR just updates a few places in the graph building to use the new API. Further analysis can determine where else this API would be beneficial.

ChrisHegarty · 2024-05-15T10:24:43Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorScorer.java

@@ -291,6 +291,11 @@ public RandomVectorScorer scorer(int ord) throws IOException {
          values2);
    }

+    @Override
+    public float score(int firstOrd, int secondOrd) throws IOException {
+      return scorer(firstOrd).score(secondOrd);


we can probably do better here, but this is good enough for now.

benwtrent

One thing the Scorer object gave us is caching of the single vector that is used many times.

The underlying Offheap vector objects cache the vector on heap and prevents multiple reads.

Anything that calls the same random vectors twice with different ordinals, and those random vectors are read on heap will suffer significant performance issues as that on heap cache is thrashed on every comparison.

I didn't review fully, but wanted to ensure this performance was a consideration.

ChrisHegarty · 2024-05-15T13:00:25Z

One thing the Scorer object gave us is caching of the single vector that is used many times.

The underlying Offheap vector objects cache the vector on heap and prevents multiple reads.

Anything that calls the same random vectors twice with different ordinals, and those random vectors are read on heap will suffer significant performance issues as that on heap cache is thrashed on every comparison.

I didn't review fully, but wanted to ensure this performance was a consideration.

Yes, this has been considered.

Since the cache is inside the VectorValues, then this behaviour remains the same - scoring of the same ordinal against the same VectorValues will read the cached value. However, there are cases where this may not be what you want. Creating a separate scorer and binding the initial ordinal can, and in some cases does, create a separate copy of the values. In fact, maybe we should be consistent here - the supplier should always carry separate copies for both the first and second ordinals.

jimczi · 2024-05-15T17:29:30Z

lucene/core/src/java/org/apache/lucene/codecs/hnsw/ScalarQuantizedVectorScorer.java

+    @Override
+    public float score(int firstOrd, int secondOrd) throws IOException {
+      return similarity.score(
+          values.vectorValue(firstOrd),


Same here, this instance should be reserved since it is used by the scorer(int ord) case?

jimczi · 2024-05-15T17:31:32Z

lucene/core/src/java/org/apache/lucene/codecs/hnsw/DefaultFlatVectorScorer.java

+    @Override
+    public float score(int firstOrd, int secondOrd) throws IOException {
+      return similarityFunction.compare(
+          vectors1.vectorValue(firstOrd), vectors2.vectorValue(secondOrd));


vectors1 instance is already used by the scorer(int ord) case. If we want to allow this I believe we need a third copy.

ha!! this is what I originally thought too, until I added a test case for this recently, see

lucene/lucene/core/src/test/org/apache/lucene/codecs/hnsw/TestFlatVectorScorer.java

Line 80 in b1d3c08

public void testMultipleByteScorers() throws IOException {

Since neither scoreSuppliers or scorers are thread safe, then their operation has to be considered in a single-threaded model. For example: if one where to do this:

ss.scorer(ord1).score(ord2) ss.scorer(ord3).score(ord4)

Then vectors1 is used to retrieve the value of, first ord1, then ord3. No issue (other than the internal cache-of-one will be invalidated when ord3 is retrieved). This is already the case. What this PR proposes is to support a model similar to:

ss.score(ord1, ord2) ss.score(ord3, ord4)

, which simply avoids the creation of two scorer instances. Make sense, or maybe I've missed your point?

Additionally, I added a few small javadoc comments to help clarify the usage.

github-actions · 2024-05-30T00:18:48Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

ChrisHegarty added 2 commits May 13, 2024 09:51

Minor javadoc clarifications

0ab8af5

Add a double addressing vector scorer

a22b58a

ChrisHegarty added the vector-based-search label May 15, 2024

ChrisHegarty requested review from mayya-sharipova and jimczi May 15, 2024 10:10

minimal test

ff59699

ChrisHegarty commented May 15, 2024

View reviewed changes

typo

ab432dd

benwtrent reviewed May 15, 2024

View reviewed changes

use a copy in SQ vector scorer

7deb3b5

jimczi reviewed May 15, 2024

View reviewed changes

Merge branch 'javadoc_minor' into double_addr_scoring

316e3ac

jimczi mentioned this pull request May 22, 2024

Refactor libvec to replace custom scorer types with Lucene types elastic/elasticsearch#108917

Merged

github-actions bot added the Stale label May 30, 2024

ChrisHegarty marked this pull request as draft May 30, 2024 07:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a double addressing vector scorer #13370

Add a double addressing vector scorer #13370

ChrisHegarty commented May 15, 2024 •

edited

ChrisHegarty May 15, 2024

benwtrent left a comment

ChrisHegarty commented May 15, 2024

jimczi May 15, 2024

jimczi May 15, 2024

ChrisHegarty May 15, 2024 •

edited

github-actions bot commented May 30, 2024

Add a double addressing vector scorer #13370

Are you sure you want to change the base?

Add a double addressing vector scorer #13370

Conversation

ChrisHegarty commented May 15, 2024 • edited

ChrisHegarty May 15, 2024

Choose a reason for hiding this comment

benwtrent left a comment

Choose a reason for hiding this comment

ChrisHegarty commented May 15, 2024

jimczi May 15, 2024

Choose a reason for hiding this comment

jimczi May 15, 2024

Choose a reason for hiding this comment

ChrisHegarty May 15, 2024 • edited

Choose a reason for hiding this comment

github-actions bot commented May 30, 2024

ChrisHegarty commented May 15, 2024 •

edited

ChrisHegarty May 15, 2024 •

edited