Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing Block WAND optimization for more queries #2390

Open
sujayakar opened this issue May 9, 2024 · 0 comments
Open

Implementing Block WAND optimization for more queries #2390

sujayakar opened this issue May 9, 2024 · 0 comments

Comments

@sujayakar
Copy link

sujayakar commented May 9, 2024

Is your feature request related to a problem? Please describe.

I only have one particular query in my system: a BooleanQuery of TermQuerys for the should, must, and must not clauses where only the should clauses contribute to scoring via BM25.

I noticed that a BooleanQuery of TermQuerys implements the Block WAND optimization within BooleanWeight::for_each_pruning when it only has should clauses, but it falls back to the generic for_each_pruning_scorer when there are any must or must not clauses.

Describe the solution you'd like

I have a prototype for our usage of Tantivy within Convex that is able to utilize Block WAND even with should or must not clauses, and I'd be curious if you would be interested in us upstreaming it!

The high level idea is to move the for_each_pruning API from the weights to the scorers. First, we add an optional method to the Scorer trait:

pub trait Scorer: DocSet {
    fn adjust_threshold(&mut self, threshold: Score);
}

This method mutates the document set, filtering out documents that have score less than or equal to threshold past the current position. The caller must call this method with monotonically increasing thresholds over the lifetime of the scorer.

Then, the protocol between the collector and the scorer moves from being callback-based (like for_each_pruning_scorer today) to alternating between advancing the document set, updating the collector, and updating the thresholds.

impl Collector for TopDocs {
    fn collect_segment(...) -> crate::Result<...> {
        ...
        let mut top_n = TopNComputer::new(heap_len);
        loop {
            let doc = scorer.doc();
            if doc == TERMINATED {
                break;
            }
            top_n.push(scorer.score());

            let threshold = top_n.threshold.unwrap_or(Score::MIN);
            scorer.adjust_threshold(threshold);
        }
    }
}          

Next, we implement a specialized WeakAnd scorer that takes most of the logic from today's block_wand.rs.

pub struct WeakAnd {
    scorers: Vec<TermScorerWithMaxScore>,
    threshold: Score,

    // Invariant: If `scorers` is nonempty, this is present with the current
    // match.    
    current_match: Option<WeakAndMatch>,
}

struct WeakAndMatch {
    len: usize,
    doc: DocId,
    score: Score,
}

(Full source: https://gist.github.com/sujayakar/c742f07dd99089f4c529a242c3b5ac9c)

The idea here is that we run the Block WAND algorithm for generating pivot candidates in both DocSet::advance and DocSet::seek. So, a layer above can call WeakAnd::seek on some doc: DocId, and it will then resume where it left off, finding the next pivot after doc that's better than the current threshold.

Then, this approach composes well with query::intersection::Intersection: If we're computing the intersection of WeakAnd with a single TermScorer (which we'll assume has a smaller size_hint), the intersection will...

  1. Iterate to the next document on the TermScorer to get a candidate
  2. Advance the WeakAnd to this candidate, returning seek_doc. This step potentially skips over many DocIds in the union if the intersection is more selective.
  3. If seek_doc > candidate, advance the TermScorer and retry. In this condition, the WeakAnd having a high threshold potentially lets us skip over many DocIds in the intersection.
  4. Otherwise, emit candidate.

A similar approach works for Exclude, which we use for the must not clauses.

I haven't tested this approach rigorously yet, and I also haven't evaluated how it performs compared to the current callback-based API. But if you are interested in this work, I can clean it up and start evaluating it. I'd also be curious how this adjust_threshold approach works with other score combiners too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant