[RFC] Improve skipping logic for after values in sort query #13313

gashutos · 2024-04-17T07:40:32Z

Description

Background

Lucene sort queries are using skipping logic for faster execution and skip non-competitive documents by updating its competitive iterator whenever it updates its bottom value in priority queue. Reference
This works for fields indexed with BKD.
In case of after value, after value is considerd as topValue we try to execute skipping logic with both topValue and afterValue. The skipping logic has a constraint, and with that, the estimated number of competitive documents must reduce to 1/8 to intersect BKD skipping logic. That can be problamatic in like described below.

Problem

With #12333 change, we started using topValue to skip documents in case topValue is able to skip 7/8 number of documents. But what if we only able to skip 6/8 number of documents with topvalue and those all non-competitive documents 6/8 are ahead of after document in docsIdIterator ?
This problem specifically comes in timeseries workload.
i.e. we have time series segment where documents ids and document values are in nearly sorted order and lets assume we have 100 documents and its time field values are from 1,2,3,....100.
Now if I trigger sort query sort on this field with after values as 87, we wont able to skip first 86 documents and they will end up being in comparison here in PagingFieldCollector.
Assume this happening for millions of documents in case of time series workload.

Proposed solution

Lets invoke skipping logic in case topValue is known but bottomvalue is unknown ir-respective of number of docuements we are able to skip. That will be one time invocation of skipping logic in case after value is specific and we will skip all documents which are non-competitive w.r.t after value. And from next iteration, we will know about bottomValue once priority queue is full and can have constraint of 7/8 documents skipping.....

The text was updated successfully, but these errors were encountered:

gashutos · 2024-04-17T07:41:25Z

@jpountz @gsmiller what you guys think on this ?
let me know if you need sample code change for this....

jpountz · 2024-04-17T20:26:51Z

Lets invoke skipping logic in case topValue is known but bottomvalue is unknown ir-respective of number of docuements we are able to skip. That will be one time invocation of skipping logic

Is my understanding correct that this would possibly help with deep pagination but hurt with the first few pages since it may force Lucene to evaluate a BKD range query that matches most docs?

jpountz · 2024-04-17T20:39:12Z

FWIW there is a similar problem for sorting by score, as document scores may be clustered in the doc ID space, which increases the time for the minimum competitite score to converge to the actual score of the top k-th hit. There are a few approaches that have been documented to evaluate a lower bound of the top-k-th score, see e.g. https://culpepper.io/publications/pm+19-sigir.pdf. I wonder if there's anything to learn from this for sorting by field.

gashutos · 2024-04-18T17:11:55Z

Is my understanding correct that this would possibly help with deep pagination but hurt with the first few pages since it may force Lucene to evaluate a BKD range query that matches most docs?

@jpountz yes, I am thinking to relax the 88% (current) non-competitive document threshold would be great solution instead blindly invoking skipping. Like for topValue in case bottomValue is not known, we can go with threshold 20%, i.e. if we are able to skip 1/5 non-competitive document, we can invoke skipping.

This will be really one time BKD intersect to skip value based on topValue since topValue remains constant throughout same query execution.

This will relly help not eve deep pagination, but mid-start, mid level pagination as well. basically any pagination which are not too early in the iterator. The performance gain I see in http_logs was around 20x with this logic when I tested this with OpenSearch for asc sort query with after value where after value was around 70% from start of docsIdIterator.
500 ms to 25 ms for that query with early skipping invocation.

gashutos · 2024-04-18T17:12:57Z

I was thinking if we can make these threshold 88%, 20% etc configurable by Lucene config as well to give more conrtol to end user to fine tune based on workload.

jpountz · 2024-04-18T20:35:52Z

I'd rather not make this configurable, but if you can suggest a new threshold that works better for this adversarial case and doesn't hurt sorted queries such as tasks from nightly benchmarks, I'd probably be good with it.

gashutos added the type:enhancement label Apr 17, 2024

RS146BIJAY mentioned this issue Apr 26, 2024

[RFC] Context Aware Segments opensearch-project/OpenSearch#13183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Improve skipping logic for after values in sort query #13313

[RFC] Improve skipping logic for after values in sort query #13313

gashutos commented Apr 17, 2024

gashutos commented Apr 17, 2024

jpountz commented Apr 17, 2024

jpountz commented Apr 17, 2024 •

edited

gashutos commented Apr 18, 2024 •

edited

gashutos commented Apr 18, 2024 •

edited

jpountz commented Apr 18, 2024

[RFC] Improve skipping logic for after values in sort query #13313

[RFC] Improve skipping logic for after values in sort query #13313

Comments

gashutos commented Apr 17, 2024

Description

Background

Problem

Proposed solution

gashutos commented Apr 17, 2024

jpountz commented Apr 17, 2024

jpountz commented Apr 17, 2024 • edited

gashutos commented Apr 18, 2024 • edited

gashutos commented Apr 18, 2024 • edited

jpountz commented Apr 18, 2024

jpountz commented Apr 17, 2024 •

edited

gashutos commented Apr 18, 2024 •

edited

gashutos commented Apr 18, 2024 •

edited