-
Notifications
You must be signed in to change notification settings - Fork 958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a double addressing vector scorer #13370
base: main
Are you sure you want to change the base?
Conversation
@@ -291,6 +291,11 @@ public RandomVectorScorer scorer(int ord) throws IOException { | |||
values2); | |||
} | |||
|
|||
@Override | |||
public float score(int firstOrd, int secondOrd) throws IOException { | |||
return scorer(firstOrd).score(secondOrd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can probably do better here, but this is good enough for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing the Scorer object gave us is caching of the single vector that is used many times.
The underlying Offheap vector objects cache the vector on heap and prevents multiple reads.
Anything that calls the same random vectors twice with different ordinals, and those random vectors are read on heap will suffer significant performance issues as that on heap cache is thrashed on every comparison.
I didn't review fully, but wanted to ensure this performance was a consideration.
Yes, this has been considered. Since the cache is inside the VectorValues, then this behaviour remains the same - scoring of the same ordinal against the same VectorValues will read the cached value. However, there are cases where this may not be what you want. Creating a separate scorer and binding the initial ordinal can, and in some cases does, create a separate copy of the values. In fact, maybe we should be consistent here - the supplier should always carry separate copies for both the first and second ordinals. |
@Override | ||
public float score(int firstOrd, int secondOrd) throws IOException { | ||
return similarity.score( | ||
values.vectorValue(firstOrd), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, this instance should be reserved since it is used by the scorer(int ord)
case?
@Override | ||
public float score(int firstOrd, int secondOrd) throws IOException { | ||
return similarityFunction.compare( | ||
vectors1.vectorValue(firstOrd), vectors2.vectorValue(secondOrd)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vectors1 instance is already used by the scorer(int ord)
case. If we want to allow this I believe we need a third copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha!! this is what I originally thought too, until I added a test case for this recently, see
lucene/lucene/core/src/test/org/apache/lucene/codecs/hnsw/TestFlatVectorScorer.java
Line 80 in b1d3c08
public void testMultipleByteScorers() throws IOException { |
Since neither scoreSuppliers or scorers are thread safe, then their operation has to be considered in a single-threaded model. For example: if one where to do this:
ss.scorer(ord1).score(ord2)
ss.scorer(ord3).score(ord4)
Then vectors1 is used to retrieve the value of, first ord1, then ord3. No issue (other than the internal cache-of-one will be invalidated when ord3 is retrieved). This is already the case. What this PR proposes is to support a model similar to:
ss.score(ord1, ord2)
ss.score(ord3, ord4)
, which simply avoids the creation of two scorer instances. Make sense, or maybe I've missed your point?
Additionally, I added a few small javadoc comments to help clarify the usage.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
This commit adds a method to RandomVectorScorerSupplier that allows to score two vectors based their ordinals.
The existing model of this API first creates a scorer, that effectively binds the ordinal of the first vector, to then score the ordinal of a second vector agains the first. This results in a RandomVectorScorer instance being created each time we want to score against a different vector in the first position. Allowing to score against two given ordinals avoids the creation of a RandomVectorScorer instance, which is likely expensive during graph building.
The new API could seen as a convenience for
scorer(ord).score(node)
, or vice versa, but in fact there can be very different implementation characteristics of these - most notably the avoidance of the creation of a RandomVectorScorer instance.This PR just updates a few places in the graph building to use the new API. Further analysis can determine where else this API would be beneficial.