Disjunction as CompetitiveIterator for numeric dynamic pruning #13221

gf2121 · 2024-03-26T15:58:02Z

This PR proposes a new way to do numeric dynamic pruning with following changes:

Instead of complex sampling and estimating point count to judge whether to build the competitive iterator, this patch proposes to find out the threshold value. That said, we find out the value that 'N docs away from' the top value, in favor of the the fact that top value should be final in LeafComparators, like the following picture shows.

Instead of building and rebuilding the competitive iterator when bottom value get more competitive, this patch proposes to build the competitive iterator as a disjunction of small DISIs. Each small DISI maintains its most competitive value and get discarded when their most competitive value is no more competitive, like what we did in TermOrdValComparator. This helps us intersect the tree only once and update the competitive iterator more frequently.
For simplification, i tweaked the bytes things to comparable long values. e.g. maxValueAsBytes -> maxValueAsLong.

Benchmark

Here is a result based on wikimedium10m (baseline contains #13199)

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      176.84      (5.0%)      303.31      (6.3%)   71.5% (  57% -   87%) 0.000
           HighTermDayOfYearSort      454.72      (3.2%)      791.09      (7.9%)   74.0% (  60% -   87%) 0.000

gf2121 · 2024-03-26T16:01:39Z

lucene/core/src/java/org/apache/lucene/index/PointValues.java

   */
-  private static long estimatePointCount(
+  public static long estimatePointCount(


I have to make the public again to get accurate point count size in intersectThresholdValue

jpountz

This is a great idea. So when we get a more competitive bottom value, we can simply subtract from our existing competitive iterator instead of recomputing this competitive iterator entirely. A benefit of this approach is that we no longer have to wait for the estimated point count to decrease by 8x before updating the competitive iterator, we can do it more often and save evaluating more docs.

A downside is that this approach may be less memory-efficient, since we store competitive docs as integers, never as a bit set like today. But we may be able to work around it by just starting dynamic pruning a bit later, when we have fewer docs to store.

It's a bit surprising to me that we are configuring the disjunction by the number of docs on each clause (MIN_BLOCK_DISI_LENGTH) as this could create disjunctions with many many clauses (which is something I don't like about TermOrdValComparator). Maybe we could configure it via a target number of clauses instead, e.g. something like min(128, leadCost / 512)?

I wonder how this approach compares with the current one in the worst-case scenario when documents get collected in competitive order, e.g. increasing order for a descending sort. I don't have a good intuition for whether it would perform better or worse, do you know?

gf2121 · 2024-03-27T08:19:13Z

A downside is that this approach may be less memory-efficient, since we store competitive docs as integers, never as a bit set like today. But we may be able to work around it by just starting dynamic pruning a bit later, when we have fewer docs to store.

+1 for memory concern. I'd limit the doc num threshold to maxDoc >> 5 so that it won't allocate more than maxDoc bits memory.

It's a bit surprising to me that we are configuring the disjunction by the number of docs on each clause (MIN_BLOCK_DISI_LENGTH) as this could create disjunctions with many many clauses (which is something I don't like about TermOrdValComparator). Maybe we could configure it via a target number of clauses instead, e.g. something like min(128, leadCost / 512)?

More disjunction clauses means we can update the competitive iterator more often but also introduce more overhead of CompetitiveIterator#advance. I like the idea to configure in the disjunction clause num, will do.

I wonder how this approach compares with the current one in the worst-case scenario when documents get collected in competitive order, e.g. increasing order for a descending sort. I don't have a good intuition for whether it would perform better or worse, do you know?

In the worst case we can not skip any docs in the first segment so I think the overhead can be considered in following perspectives:

Update bottom overhead: Current logic need to estimate the point count several round while this patch computes the threshold value once. This patch wins.
CompetitiveIterator iteration overhead: Current CompetitiveIterator builds a array / bitset iterator so it has a cheap advance while this patch builds a disjunction of small arrays. Main wins.
CompetitiveIterator building overhead:
- For low/medium cardinality fields, this patch can win because we are building small disis so we can skip the sort for a leaf block when docs are in order.
- For high cardinality fields, if current logic kick in the bitset case then Main win because building a bitset can be cheaper than sort.
- For high cardinality fields, if current logic not kick in the bitset then they are equal since we are using a O(n) LSBRadixSorter to sort these docs.

In general, for the worst case, i think this patch may speed up medium cardinality fields but slow down high cardinality fields that kick in bitset. I can play with luceneutil to verify the guess.

gf2121 · 2024-03-27T09:25:18Z

I separately build two wikimedium10m indices that force merged and reverse sorted by dayOfYear/lastMod and following is the result.

wikimedium10m.lucene_baseline.Lucene99.dvfields.sort=dayOfYear:int.nd10M

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
           HighTermDayOfYearSort      493.06      (1.3%)     1252.27      (4.5%)  154.0% ( 146% -  162%) 0.000

wikimedium10m.lucene_baseline.Lucene99.dvfields.sort=lastMod:long.nd10M

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      369.84      (3.1%)      357.03      (3.7%)   -3.5% (  -9% -    3%) 0.001

gf2121 · 2024-03-27T14:33:20Z

I run 10 rounds on wikimediumall index (normal index, no index sorting / force merge). Result looks positive in general, but we do meet slight regression for high-cardinality field in several rounds (3/10).

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      377.08      (4.3%)      364.56      (5.6%)   -3.3% ( -12% -    6%) 0.036
           HighTermDayOfYearSort      620.13      (4.1%)      783.98      (7.8%)   26.4% (  13% -   39%) 0.000
           
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      334.03      (8.0%)      357.86      (5.5%)    7.1% (  -5% -   22%) 0.001
           HighTermDayOfYearSort      565.44      (6.3%)     1043.55      (6.3%)   84.6% (  67% -  103%) 0.000
           
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      434.01      (4.1%)      491.14      (4.3%)   13.2% (   4% -   22%) 0.000
           HighTermDayOfYearSort      606.48      (3.6%)      914.42      (3.3%)   50.8% (  42% -   59%) 0.000
           
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      467.90      (5.1%)      498.69      (4.5%)    6.6% (  -2% -   17%) 0.000
           HighTermDayOfYearSort      784.73      (5.2%)     1015.12      (2.6%)   29.4% (  20% -   39%) 0.000
           
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      347.23      (4.3%)      319.38      (1.9%)   -8.0% ( -13% -   -1%) 0.000
           HighTermDayOfYearSort      703.03      (3.5%)      962.18      (1.9%)   36.9% (  30% -   43%) 0.000
           
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      348.83      (6.9%)      380.95      (5.0%)    9.2% (  -2% -   22%) 0.000
           HighTermDayOfYearSort      474.03      (5.4%)     1071.19      (6.0%)  126.0% ( 108% -  145%) 0.000
           
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      364.28      (7.6%)      350.49      (7.5%)   -3.8% ( -17% -   12%) 0.111
           HighTermDayOfYearSort      761.71      (5.8%)     1090.74      (9.0%)   43.2% (  26% -   61%) 0.000
           
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      351.76      (8.7%)      381.25     (10.4%)    8.4% (  -9% -   30%) 0.006
           HighTermDayOfYearSort      599.32      (6.5%)      791.83      (9.2%)   32.1% (  15% -   51%) 0.000
           
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      313.74      (4.3%)      319.89      (4.7%)    2.0% (  -6% -   11%) 0.168
           HighTermDayOfYearSort      642.69      (5.6%)      836.24      (5.0%)   30.1% (  18% -   43%) 0.000
           
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      421.29      (4.3%)      498.57      (3.2%)   18.3% (  10% -   27%) 0.000
           HighTermDayOfYearSort      544.96      (3.1%)      865.97      (2.2%)   58.9% (  51% -   66%) 0.000

jpountz

I left some questions, this code is a bit complex (the existing one, not only your change!) so it's taking me a bit of time to fully understand how it's working.

lucene/core/src/java/org/apache/lucene/search/comparators/DoubleComparator.java

lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java

…to numeric_dynamic_pruning_optimize

github-actions · 2024-05-01T00:19:22Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

jpountz

This is a fascinating optimization! I left some comments/questions but this looks generally correct to me.

jpountz · 2024-05-14T13:21:30Z

lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java

+          leadCost = Math.min(leadCost, competitiveIterator.cost());
+        }
+        long threshold = Math.min(leadCost >> 3, maxDoc >> 5);
+        thresholdAsLong = intersectThresholdValue(threshold);


What if thresholdAsLong was -1 here, would we then run the computation again later? Do we need to make thresholdAsLong an Optional, or at least a nullable Long?

Nice catch!

jpountz · 2024-05-14T14:40:26Z

lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java

+    private void tryReduceDisjunctionClause(CompetitiveIterator iter) {
+      int originalSize = iter.disis.size();
+      Predicate<DisiAndMostCompetitiveValue> isCompetitive =
+          d -> d.mostCompetitiveValue <= maxValueAsLong && d.mostCompetitiveValue >= minValueAsLong;


it's a bit awkward to define this predicate which is then only used once, could you inline it in the below loop?

jpountz · 2024-05-14T14:50:24Z

lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java

+        }
+        long mostCompetitiveValue =
+            reverse == false
+                ? Math.max(blockMinValue, minValueAsLong)


I'm not sure why we need to take the max with minValueAsLong?

The most left leaf block will kick in visitDocValues and only docs with values between minValueAsLong and blockMaxValue will be consumed.

So blockMinValue should already be greater than or equal to minValueAsLong, shouldn't it?

Hmmm.. forgive me if i misunderstand something, but I think blockMinValue can be smaller than minValueAsLong when the leaf block has CELL_CROSSED_QUERY relation with minValueAsLong ?

jpountz · 2024-05-14T14:58:00Z

lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java

+      private int minBlockLength() {
+        // bottom value can be much more competitive than thresholdAsLong, recompute the cost.
+        int cost =
+            Math.toIntExact(


can we actually safely cast to an int? the field may be multi-valued and have more than Integer.MAX_VALUE values?

FWIW I'd be ok with throwing an exception if there are more than MAX_DISJUNCTION_CLAUSE*Integer.MAX_VALUE values or something like that, so that minBlockLength can still be stored as an integer.

can we actually safely cast to an int? the field may be multi-valued and have more than Integer.MAX_VALUE values?

Nice catch! I thought the count of points between thresholdValueAsLong and topValue is limited to maxDoc >> 5 (link) so it can be cast safely. But if there are many many (more than Integer.MAX_VALUE - maxDoc / 128) points' value equals thresholdValueAsLong the cast will not be safe.

FWIW I'd be ok with throwing an exception if there are more than MAX_DISJUNCTION_CLAUSE*Integer.MAX_VALUE values or something like that, so that minBlockLength can still be stored as an integer.

+1. BTW I wonder if we should catch all exceptions when trying dynamic pruning and fallback to normal logic?

This sounds too fragile to me, I'd rather check very specific conditions.

jpountz

This looks good to me, this is a fascinating approach.

FYI we'll need to add // nightly-benchmarks-results-changed // to the commit message to set expectations with nightly benchmarks that sorting tasks return a different hit count with this change. cc @mikemccand

mikemccand · 2024-05-17T16:29:42Z

FYI we'll need to add // nightly-benchmarks-results-changed // to the commit message to set expectations with nightly benchmarks that sorting tasks return a different hit count with this change. cc @mikemccand

I am so excited to see if this (nightly benchmarks auto-regolding) finally works :) It so rarely gets tested!

jpountz · 2024-05-21T09:51:10Z

I am so excited to see if this (nightly benchmarks auto-regolding) finally works

Looks like it worked! https://people.apache.org/~mikemccand/lucenebench/TermDayOfYearSort.html

Disjunction as CompetitiveIterator for numeric dynamic pruning

5e95d36

gf2121 commented Mar 26, 2024

View reviewed changes

gf2121 added 2 commits March 27, 2024 00:20

reorder conditions

8954646

encode bottom

1dd5e96

jpountz reviewed Mar 26, 2024

View reviewed changes

review iter

bf30e3a

move to record

0c71d8e

jpountz reviewed Mar 29, 2024

View reviewed changes

gf2121 added 7 commits March 29, 2024 22:25

find threshold value by position

8a6ca44

Merge remote-tracking branch 'my/numeric_dynamic_pruning_optimize' in…

896228f

…to numeric_dynamic_pruning_optimize

fix test

c261987

resovle conflict

1cbf567

more work to make IntArrayDocIdSet public

6f52950

check fix

032d51b

iter

1886178

gf2121 requested a review from jpountz April 16, 2024 11:47

github-actions bot added the Stale label May 1, 2024

jpountz reviewed May 14, 2024

View reviewed changes

github-actions bot removed the Stale label May 15, 2024

gf2121 added 3 commits May 15, 2024 16:47

review iter

87b716e

iter

35e802c

tidy

25a7fee

jpountz approved these changes May 17, 2024

View reviewed changes

gf2121 merged commit 1ee4f8a into apache:main May 20, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disjunction as CompetitiveIterator for numeric dynamic pruning #13221

Disjunction as CompetitiveIterator for numeric dynamic pruning #13221

gf2121 commented Mar 26, 2024 •

edited

gf2121 Mar 26, 2024

jpountz left a comment

gf2121 commented Mar 27, 2024 •

edited

gf2121 commented Mar 27, 2024 •

edited

gf2121 commented Mar 27, 2024

jpountz left a comment

github-actions bot commented May 1, 2024

jpountz left a comment

jpountz May 14, 2024

gf2121 May 15, 2024

jpountz May 14, 2024

jpountz May 14, 2024

gf2121 May 15, 2024

jpountz May 17, 2024

gf2121 May 17, 2024

jpountz May 14, 2024

gf2121 May 15, 2024

jpountz May 17, 2024

jpountz left a comment

mikemccand commented May 17, 2024

jpountz commented May 21, 2024

Disjunction as CompetitiveIterator for numeric dynamic pruning #13221

Disjunction as CompetitiveIterator for numeric dynamic pruning #13221

Conversation

gf2121 commented Mar 26, 2024 • edited

Benchmark

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

gf2121 commented Mar 27, 2024 • edited

gf2121 commented Mar 27, 2024 • edited

gf2121 commented Mar 27, 2024

jpountz left a comment

Choose a reason for hiding this comment

github-actions bot commented May 1, 2024

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

mikemccand commented May 17, 2024

jpountz commented May 21, 2024

gf2121 commented Mar 26, 2024 •

edited

gf2121 commented Mar 27, 2024 •

edited

gf2121 commented Mar 27, 2024 •

edited