Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Improve skipping logic for after values in sort query #13313

Open
gashutos opened this issue Apr 17, 2024 · 6 comments
Open

[RFC] Improve skipping logic for after values in sort query #13313

gashutos opened this issue Apr 17, 2024 · 6 comments

Comments

@gashutos
Copy link
Contributor

Description

Background

Lucene sort queries are using skipping logic for faster execution and skip non-competitive documents by updating its competitive iterator whenever it updates its bottom value in priority queue. Reference
This works for fields indexed with BKD.
In case of after value, after value is considerd as topValue we try to execute skipping logic with both topValue and afterValue. The skipping logic has a constraint, and with that, the estimated number of competitive documents must reduce to 1/8 to intersect BKD skipping logic. That can be problamatic in like described below.

Problem

With #12333 change, we started using topValue to skip documents in case topValue is able to skip 7/8 number of documents. But what if we only able to skip 6/8 number of documents with topvalue and those all non-competitive documents 6/8 are ahead of after document in docsIdIterator ?
This problem specifically comes in timeseries workload.
i.e. we have time series segment where documents ids and document values are in nearly sorted order and lets assume we have 100 documents and its time field values are from 1,2,3,....100.
Now if I trigger sort query sort on this field with after values as 87, we wont able to skip first 86 documents and they will end up being in comparison here in PagingFieldCollector.
Assume this happening for millions of documents in case of time series workload.

Proposed solution

Lets invoke skipping logic in case topValue is known but bottomvalue is unknown ir-respective of number of docuements we are able to skip. That will be one time invocation of skipping logic in case after value is specific and we will skip all documents which are non-competitive w.r.t after value. And from next iteration, we will know about bottomValue once priority queue is full and can have constraint of 7/8 documents skipping.....

@gashutos
Copy link
Contributor Author

@jpountz @gsmiller what you guys think on this ?
let me know if you need sample code change for this....

@jpountz
Copy link
Contributor

jpountz commented Apr 17, 2024

Lets invoke skipping logic in case topValue is known but bottomvalue is unknown ir-respective of number of docuements we are able to skip. That will be one time invocation of skipping logic

Is my understanding correct that this would possibly help with deep pagination but hurt with the first few pages since it may force Lucene to evaluate a BKD range query that matches most docs?

@jpountz
Copy link
Contributor

jpountz commented Apr 17, 2024

FWIW there is a similar problem for sorting by score, as document scores may be clustered in the doc ID space, which increases the time for the minimum competitite score to converge to the actual score of the top k-th hit. There are a few approaches that have been documented to evaluate a lower bound of the top-k-th score, see e.g. https://culpepper.io/publications/pm+19-sigir.pdf. I wonder if there's anything to learn from this for sorting by field.

@gashutos
Copy link
Contributor Author

gashutos commented Apr 18, 2024

Is my understanding correct that this would possibly help with deep pagination but hurt with the first few pages since it may force Lucene to evaluate a BKD range query that matches most docs?

@jpountz yes, I am thinking to relax the 88% (current) non-competitive document threshold would be great solution instead blindly invoking skipping. Like for topValue in case bottomValue is not known, we can go with threshold 20%, i.e. if we are able to skip 1/5 non-competitive document, we can invoke skipping.

This will be really one time BKD intersect to skip value based on topValue since topValue remains constant throughout same query execution.

This will relly help not eve deep pagination, but mid-start, mid level pagination as well. basically any pagination which are not too early in the iterator. The performance gain I see in http_logs was around 20x with this logic when I tested this with OpenSearch for asc sort query with after value where after value was around 70% from start of docsIdIterator.
500 ms to 25 ms for that query with early skipping invocation.

@gashutos
Copy link
Contributor Author

gashutos commented Apr 18, 2024

I was thinking if we can make these threshold 88%, 20% etc configurable by Lucene config as well to give more conrtol to end user to fine tune based on workload.

@jpountz
Copy link
Contributor

jpountz commented Apr 18, 2024

I'd rather not make this configurable, but if you can suggest a new threshold that works better for this adversarial case and doesn't hurt sorted queries such as tasks from nightly benchmarks, I'd probably be good with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants