Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a double addressing vector scorer #13370

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ChrisHegarty
Copy link
Contributor

@ChrisHegarty ChrisHegarty commented May 15, 2024

This commit adds a method to RandomVectorScorerSupplier that allows to score two vectors based their ordinals.

The existing model of this API first creates a scorer, that effectively binds the ordinal of the first vector, to then score the ordinal of a second vector agains the first. This results in a RandomVectorScorer instance being created each time we want to score against a different vector in the first position. Allowing to score against two given ordinals avoids the creation of a RandomVectorScorer instance, which is likely expensive during graph building.

The new API could seen as a convenience for scorer(ord).score(node), or vice versa, but in fact there can be very different implementation characteristics of these - most notably the avoidance of the creation of a RandomVectorScorer instance.

This PR just updates a few places in the graph building to use the new API. Further analysis can determine where else this API would be beneficial.

@@ -291,6 +291,11 @@ public RandomVectorScorer scorer(int ord) throws IOException {
values2);
}

@Override
public float score(int firstOrd, int secondOrd) throws IOException {
return scorer(firstOrd).score(secondOrd);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably do better here, but this is good enough for now.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing the Scorer object gave us is caching of the single vector that is used many times.

The underlying Offheap vector objects cache the vector on heap and prevents multiple reads.

Anything that calls the same random vectors twice with different ordinals, and those random vectors are read on heap will suffer significant performance issues as that on heap cache is thrashed on every comparison.

I didn't review fully, but wanted to ensure this performance was a consideration.

@ChrisHegarty
Copy link
Contributor Author

One thing the Scorer object gave us is caching of the single vector that is used many times.

The underlying Offheap vector objects cache the vector on heap and prevents multiple reads.

Anything that calls the same random vectors twice with different ordinals, and those random vectors are read on heap will suffer significant performance issues as that on heap cache is thrashed on every comparison.

I didn't review fully, but wanted to ensure this performance was a consideration.

Yes, this has been considered.

Since the cache is inside the VectorValues, then this behaviour remains the same - scoring of the same ordinal against the same VectorValues will read the cached value. However, there are cases where this may not be what you want. Creating a separate scorer and binding the initial ordinal can, and in some cases does, create a separate copy of the values. In fact, maybe we should be consistent here - the supplier should always carry separate copies for both the first and second ordinals.

@Override
public float score(int firstOrd, int secondOrd) throws IOException {
return similarity.score(
values.vectorValue(firstOrd),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this instance should be reserved since it is used by the scorer(int ord) case?

@Override
public float score(int firstOrd, int secondOrd) throws IOException {
return similarityFunction.compare(
vectors1.vectorValue(firstOrd), vectors2.vectorValue(secondOrd));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vectors1 instance is already used by the scorer(int ord) case. If we want to allow this I believe we need a third copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha!! this is what I originally thought too, until I added a test case for this recently, see

public void testMultipleByteScorers() throws IOException {

Since neither scoreSuppliers or scorers are thread safe, then their operation has to be considered in a single-threaded model. For example: if one where to do this:

ss.scorer(ord1).score(ord2)
ss.scorer(ord3).score(ord4)

Then vectors1 is used to retrieve the value of, first ord1, then ord3. No issue (other than the internal cache-of-one will be invalidated when ord3 is retrieved). This is already the case. What this PR proposes is to support a model similar to:

ss.score(ord1, ord2)
ss.score(ord3, ord4)

, which simply avoids the creation of two scorer instances. Make sense, or maybe I've missed your point?

Additionally, I added a few small javadoc comments to help clarify the usage.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label May 30, 2024
@ChrisHegarty ChrisHegarty marked this pull request as draft May 30, 2024 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants