feat: add a simple way to chain two tokenizers #2304

ctron · 2024-01-18T17:03:01Z

No description provided.

ctron · 2024-01-19T07:29:50Z

I applied nightly rustfmt.

fulmicoton · 2024-01-19T10:37:23Z

@ctron can you describe your use case?

ctron · 2024-01-19T11:32:13Z

My use case is to have all simple tokens plus all ngrams.

src/tokenizer/chain.rs

adamreichold · 2024-01-19T20:13:34Z

src/tokenizer/chain.rs

+    Done,
+}
+
+pub struct ChainTokenStream<'a, F, S>


The Token::position fields would need updating though, wouldn't they? Meaning the positions of the second stream should be offset by the number of tokens yielded by the first one?

src/tokenizer/chain.rs

ctron · 2024-01-22T08:49:35Z

I was able to incorporate most of the feedback you mentioned. It's less explicit without the enum, but works the same way. There was just one call to second.advance() missing, as a TokenStream seems to start before the first item.

I am not sure about the position topic, but I also don't fully understand it: If the idea is to enumerate/count all tokens, then yes that should be offset but the length of first. If that reflects the position of the token as part of the original input, then that might not be correct.

I'll admit that the whole API around tokenization feels a bit confusing.

adamreichold · 2024-01-22T09:12:02Z

I was able to incorporate most of the feedback you mentioned. It's less explicit without the enum, but works the same way. There was just one call to second.advance() missing, as a TokenStream seems to start before the first item.

Agreed, it is more implicit. But I was mainly suggesting it for efficiency, i.e. keep the state as small as possible and drop the first tokenizer as soon as we are done with it. If you feel like code-golfing, I think those two calls to second.advance() could even be merged:

fn advance(&mut self) -> bool {
    if let Some(first) = &mut self.first {
        if first.advance() {
            return true;
        } else {
            self.first = None;
        }
    }
    
    self.second.advance()
}

ctron · 2024-01-22T09:24:40Z

If you feel like code-golfing, I think those two calls to second.advance() could even be merged

I like that, pushed.

So, the remaining thing seems to be the position. I am just not sure what to do with it.

adamreichold · 2024-01-22T09:44:28Z

So, the remaining thing seems to be the position. I am just not sure what to do with it.

I'd say let's wait for input from @fulmicoton on that. I am myself unsure what downstream consumers expect of the position field. I suspect that this is mainly used for phrase queries with slop which I think would make the current implementation correct, i.e. have position still refer to the original input.

fulmicoton · 2024-01-23T15:15:25Z

@PSeitz can you review ?

ctron · 2024-01-23T15:35:13Z

Fixed the test issue.

PSeitz · 2024-02-05T03:49:39Z

There is a hidden contract currently on the tokenizer API which expects positions to be sorted incrementally. This happens in the serialization part in recorder.rs, when positions are delta encoded.

There are two options:

Handle unsorted positions in the serialization (allow unsorted positions from the tokenizer)
Return the tokens in a sorted way

I think handling unsorted positions is not really favorable, since it would carry some performance overhead.

adamreichold · 2024-02-05T06:25:39Z

Return the tokens in a sorted way

So in this case, we'd need to interleave the output of the two tokenizer dynamically to ensure that one does not outpace the other. Or could just make positions up and offset all positions returned by the second tokenizer?

feat: add a simple way to chain two tokenizers

7068477

ctron force-pushed the feature/add_chain_1 branch from 04fe71e to 7068477 Compare January 19, 2024 07:28

ctron mentioned this pull request Jan 19, 2024

Ngram + Stemmer combination #2303

Open

fulmicoton requested a review from PSeitz January 19, 2024 08:47

adamreichold reviewed Jan 19, 2024

View reviewed changes

ctron force-pushed the feature/add_chain_1 branch from 09f3012 to 9bc739d Compare January 22, 2024 08:55

refactor: incorporate feedback from the review

92e3c5f

ctron force-pushed the feature/add_chain_1 branch from 9bc739d to 92e3c5f Compare January 22, 2024 09:23

chore: fix test issue after refactoring

783ad5f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add a simple way to chain two tokenizers #2304

feat: add a simple way to chain two tokenizers #2304

ctron commented Jan 18, 2024

ctron commented Jan 19, 2024

fulmicoton commented Jan 19, 2024

ctron commented Jan 19, 2024

adamreichold Jan 19, 2024

ctron commented Jan 22, 2024

adamreichold commented Jan 22, 2024

ctron commented Jan 22, 2024

adamreichold commented Jan 22, 2024

fulmicoton commented Jan 23, 2024

ctron commented Jan 23, 2024

PSeitz commented Feb 5, 2024 •

edited

adamreichold commented Feb 5, 2024

feat: add a simple way to chain two tokenizers #2304

Are you sure you want to change the base?

feat: add a simple way to chain two tokenizers #2304

Conversation

ctron commented Jan 18, 2024

ctron commented Jan 19, 2024

fulmicoton commented Jan 19, 2024

ctron commented Jan 19, 2024

adamreichold Jan 19, 2024

Choose a reason for hiding this comment

ctron commented Jan 22, 2024

adamreichold commented Jan 22, 2024

ctron commented Jan 22, 2024

adamreichold commented Jan 22, 2024

fulmicoton commented Jan 23, 2024

ctron commented Jan 23, 2024

PSeitz commented Feb 5, 2024 • edited

adamreichold commented Feb 5, 2024

PSeitz commented Feb 5, 2024 •

edited