-
Notifications
You must be signed in to change notification settings - Fork 958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get better cost estimate on MultiTermQuery over few terms #13201
base: main
Are you sure you want to change the base?
Conversation
The existing cost estimate for multi-term queries was taking the worst-case scenario (e.g the query matches every term in the field), to provide an upper bound on the cost. In cases where the MultiTermQuery matches a small number of terms, we can provide a tighter estimate by summing the doc frequencies over those terms.
private long estimateCost(Terms terms, long queryTermsCount) throws IOException { | ||
|
||
// Try to explicitly compute the cost, but only if the term count is sufficiently small. | ||
if (queryTermsCount < MAX_TERMS_TO_COUNT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If queryTermsCount
is unknown (i.e. -1), we'll also enter here, I guess this is intended, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct -- I thought about making that more explicit, but opted to take advantage of the "fall through" logic.
@@ -292,7 +292,21 @@ public long cost() { | |||
}; | |||
} | |||
|
|||
private static long estimateCost(Terms terms, long queryTermsCount) throws IOException { | |||
private static final int MAX_TERMS_TO_COUNT = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stupid question: would it be feasible / desirable to make this value somehow configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a great question! Maybe we could pass it as a constructor parameter, and the constant RewriteMethod
s in MultiTermQuery
could specify the default?
I host a weekly OpenSearch Lucene Study Group on Zoom and we spent this week's meetup talking about this PR (and your issue): https://forum.opensearch.org/t/opensearch-lucene-study-group-meeting-monday-march-25th-2024/18547/3
There were some nice ideas that came up there around how to pick something better than an arbitrary limit -- like maybe using a time threshold instead. (But would that be a threshold per-clause?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was indeed an interesting discussion. Thanks for sharing @msfroh !
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Description
The existing cost estimate for multi-term queries was taking the worst-case scenario (e.g the query matches every term in the field), to provide an upper bound on the cost.
In cases where the MultiTermQuery matches a small number of terms, we can provide a tighter estimate by summing the doc frequencies over those terms.
Fixes #13029