[BUG]: Citations not reflecting custom sized embedding #1259

RahSwe · 2024-05-01T22:26:08Z

How are you running AnythingLLM?

Docker (remote machine)

What happened?

Having started a fresh instance of ALLM and setting the custom chunk size to 800 tokens, the citations shown are much shorter than the actual vector size. As an example:

Are there known steps to reproduce?

See above

Propheticus · 2024-05-11T22:33:24Z

It's unexpected, but the chunk length value is in characters not in tokens. So it's smaller by a factor 3 to 4. (So is the sequence length setting for the embedding model btw)
800 tokens would be a setting of ~3000 characters.

timothycarambat · 2024-05-12T03:20:48Z

@Propheticus is correct, the spit size is by chars, not tokens. This is actually intentional and known as counting tokens depends on the tokenizer of the model, which we cannot really replicate since we only have access to the tokenizer that comes with tiktoken and the reason we dont rely on that is that it can be expensive to calculate for each embedding.

I do think though we are at a point where we can probably rely on that library more as we for sure "underestimate" the token length since we count by chars, which is indeed off by some factor depending on tokenizer

RahSwe · 2024-05-12T07:06:10Z

@Propheticus @timothycarambat

Mistake on my side, I did not realize the setting was for characters not tokens. Which means that I will now re-vectorize all of my embeddings.

However it looks like the citations are still too small even when counting characters and bot tokens?

RahSwe · 2024-05-12T08:14:53Z

@Propheticus @timothycarambat

Further testing, it seems that something is off:

I increased the chunk size from 800 characters to 4 000 characters (using OpenAI embedding model and LanceDB).
I then deleted all vectors using the workspace option
I then made new vectors
I then made test queries. The citations show numbers of characters in the range 700-900 characters.

RahSwe · 2024-05-12T08:30:27Z

@Propheticus @timothycarambat

Further testing:

I tried not only resetting the workspace vector database but also deleting all the files, re-upload the files and then embed.
I now get chunks and citations approximately corresponding to the custom chunk size. However it is not exactly corresponding to the 4 000 character setting, as an example:

RahSwe added the possible bug Bug was reported but is not confirmed or is unable to be replicated. label May 1, 2024

timothycarambat added enhancement New feature or request and removed possible bug Bug was reported but is not confirmed or is unable to be replicated. labels May 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Citations not reflecting custom sized embedding #1259

[BUG]: Citations not reflecting custom sized embedding #1259

RahSwe commented May 1, 2024

Propheticus commented May 11, 2024 •

edited

timothycarambat commented May 12, 2024

RahSwe commented May 12, 2024 •

edited

RahSwe commented May 12, 2024

RahSwe commented May 12, 2024

[BUG]: Citations not reflecting custom sized embedding #1259

[BUG]: Citations not reflecting custom sized embedding #1259

Comments

RahSwe commented May 1, 2024

How are you running AnythingLLM?

What happened?

Are there known steps to reproduce?

Propheticus commented May 11, 2024 • edited

timothycarambat commented May 12, 2024

RahSwe commented May 12, 2024 • edited

RahSwe commented May 12, 2024

RahSwe commented May 12, 2024

Propheticus commented May 11, 2024 •

edited

RahSwe commented May 12, 2024 •

edited