-
Notifications
You must be signed in to change notification settings - Fork 962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disjunction as CompetitiveIterator for numeric dynamic pruning #13221
Conversation
*/ | ||
private static long estimatePointCount( | ||
public static long estimatePointCount( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to make the public again to get accurate point count size in intersectThresholdValue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea. So when we get a more competitive bottom value, we can simply subtract from our existing competitive iterator instead of recomputing this competitive iterator entirely. A benefit of this approach is that we no longer have to wait for the estimated point count to decrease by 8x before updating the competitive iterator, we can do it more often and save evaluating more docs.
A downside is that this approach may be less memory-efficient, since we store competitive docs as integers, never as a bit set like today. But we may be able to work around it by just starting dynamic pruning a bit later, when we have fewer docs to store.
It's a bit surprising to me that we are configuring the disjunction by the number of docs on each clause (MIN_BLOCK_DISI_LENGTH
) as this could create disjunctions with many many clauses (which is something I don't like about TermOrdValComparator
). Maybe we could configure it via a target number of clauses instead, e.g. something like min(128, leadCost / 512)
?
I wonder how this approach compares with the current one in the worst-case scenario when documents get collected in competitive order, e.g. increasing order for a descending sort. I don't have a good intuition for whether it would perform better or worse, do you know?
+1 for memory concern. I'd limit the
More disjunction clauses means we can update the competitive iterator more often but also introduce more overhead of
In the worst case we can not skip any docs in the first segment so I think the overhead can be considered in following perspectives:
In general, for the worst case, i think this patch may speed up medium cardinality fields but slow down high cardinality fields that kick in bitset. I can play with luceneutil to verify the guess. |
I separately build two wikimedium10m.lucene_baseline.Lucene99.dvfields.sort=dayOfYear:int.nd10M
wikimedium10m.lucene_baseline.Lucene99.dvfields.sort=lastMod:long.nd10M
|
I run 10 rounds on wikimediumall index (normal index, no index sorting / force merge). Result looks positive in general, but we do meet slight regression for high-cardinality field in several rounds (3/10).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some questions, this code is a bit complex (the existing one, not only your change!) so it's taking me a bit of time to fully understand how it's working.
lucene/core/src/java/org/apache/lucene/search/comparators/DoubleComparator.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java
Show resolved
Hide resolved
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fascinating optimization! I left some comments/questions but this looks generally correct to me.
leadCost = Math.min(leadCost, competitiveIterator.cost()); | ||
} | ||
long threshold = Math.min(leadCost >> 3, maxDoc >> 5); | ||
thresholdAsLong = intersectThresholdValue(threshold); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if thresholdAsLong
was -1 here, would we then run the computation again later? Do we need to make thresholdAsLong
an Optional, or at least a nullable Long
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
private void tryReduceDisjunctionClause(CompetitiveIterator iter) { | ||
int originalSize = iter.disis.size(); | ||
Predicate<DisiAndMostCompetitiveValue> isCompetitive = | ||
d -> d.mostCompetitiveValue <= maxValueAsLong && d.mostCompetitiveValue >= minValueAsLong; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit awkward to define this predicate which is then only used once, could you inline it in the below loop?
} | ||
long mostCompetitiveValue = | ||
reverse == false | ||
? Math.max(blockMinValue, minValueAsLong) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we need to take the max with minValueAsLong
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most left leaf block will kick in visitDocValues
and only docs with values between minValueAsLong
and blockMaxValue
will be consumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So blockMinValue
should already be greater than or equal to minValueAsLong
, shouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm.. forgive me if i misunderstand something, but I think blockMinValue
can be smaller than minValueAsLong
when the leaf block has CELL_CROSSED_QUERY
relation with minValueAsLong
?
private int minBlockLength() { | ||
// bottom value can be much more competitive than thresholdAsLong, recompute the cost. | ||
int cost = | ||
Math.toIntExact( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we actually safely cast to an int? the field may be multi-valued and have more than Integer.MAX_VALUE values?
FWIW I'd be ok with throwing an exception if there are more than MAX_DISJUNCTION_CLAUSE*Integer.MAX_VALUE values or something like that, so that minBlockLength
can still be stored as an integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we actually safely cast to an int? the field may be multi-valued and have more than Integer.MAX_VALUE values?
Nice catch! I thought the count of points between thresholdValueAsLong
and topValue
is limited to maxDoc >> 5
(link) so it can be cast safely. But if there are many many (more than Integer.MAX_VALUE - maxDoc / 128
) points' value equals thresholdValueAsLong
the cast will not be safe.
FWIW I'd be ok with throwing an exception if there are more than MAX_DISJUNCTION_CLAUSE*Integer.MAX_VALUE values or something like that, so that minBlockLength can still be stored as an integer.
+1. BTW I wonder if we should catch all exceptions when trying dynamic pruning and fallback to normal logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds too fragile to me, I'd rather check very specific conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, this is a fascinating approach.
FYI we'll need to add // nightly-benchmarks-results-changed //
to the commit message to set expectations with nightly benchmarks that sorting tasks return a different hit count with this change. cc @mikemccand
I am so excited to see if this (nightly benchmarks auto-regolding) finally works :) It so rarely gets tested! |
Looks like it worked! https://people.apache.org/~mikemccand/lucenebench/TermDayOfYearSort.html |
This PR proposes a new way to do numeric dynamic pruning with following changes:
Instead of building and rebuilding the competitive iterator when bottom value get more competitive, this patch proposes to build the competitive iterator as a disjunction of small DISIs. Each small DISI maintains its most competitive value and get discarded when their most competitive value is no more competitive, like what we did in
TermOrdValComparator
. This helps us intersect the tree only once and update the competitive iterator more frequently.For simplification, i tweaked the bytes things to comparable long values. e.g.
maxValueAsBytes
->maxValueAsLong
.Benchmark
Here is a result based on wikimedium10m (baseline contains #13199)