Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi range traversal for numeric range aggregations #13335

Open
jainankitk opened this issue May 1, 2024 · 4 comments
Open

Multi range traversal for numeric range aggregations #13335

jainankitk opened this issue May 1, 2024 · 4 comments

Comments

@jainankitk
Copy link

Description

In OpenSearch, we have introduced multi range traversal for collecting matching document count in single tree traversal (opensearch-project/OpenSearch#13317). That has helped improve the performance of numeric aggregations in OpenSearch significantly. I am wondering if there are other use cases that can benefit from this and change should be included in Lucene.

Constraints:

  1. The ranges are non-overlapping and in increasing order. For example - In (a1,b1),(a2,b2)...(an,bn), it is assumed ai<aj for all i<j and bi < a(i+1) for every i
  2. Field is 1 dimensional guaranteeing the points are stored in increasing order, and segment does not have any deletions
@jainankitk
Copy link
Author

Looks related to #9814, but there are differences between the two. As discussed offline with @bowenlan and @rishabhmaurya, MultiRangeQuery only tells what matches with your multi ranges, not what matches with each of your multi ranges.

@jainankitk
Copy link
Author

@jpountz - Thoughts?

@mikemccand
Copy link
Member

I like this optimization. Maybe it best fits in Lucene's facet module? But, I don't think our facet impls today ever use points, directly, to do counting/aggregation -- it's a two step process of first collecting into a bitset holding the matched docs, and, second, iterating those docs and looking doc values or facet ords (also from doc values) and counting/aggregating from there. But in the browse-only cases where a query just wants counts of ranges across all docs in the index, this opto should be a crazy fast way to achieve it when there are no deletions. Even when there are deletions, this opto could visit all docs and check the live docs and count/aggregate accordingly? The time is no longer sub-linear, but it'd still be faster than the two phased approach that Lucene's facets use today?

@stefanvodita / @Shradha26 WDYT?

@stefanvodita
Copy link
Contributor

It's true that we have this two-step process for aggregations (incl. counts) and that it's not always the most efficient solution.
+1 to try out this optimisation, sounds promising!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants