Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Citations not reflecting custom sized embedding #1259

Open
RahSwe opened this issue May 1, 2024 · 5 comments
Open

[BUG]: Citations not reflecting custom sized embedding #1259

RahSwe opened this issue May 1, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@RahSwe
Copy link

RahSwe commented May 1, 2024

How are you running AnythingLLM?

Docker (remote machine)

What happened?

Having started a fresh instance of ALLM and setting the custom chunk size to 800 tokens, the citations shown are much shorter than the actual vector size. As an example:

image

Are there known steps to reproduce?

See above

@RahSwe RahSwe added the possible bug Bug was reported but is not confirmed or is unable to be replicated. label May 1, 2024
@Propheticus
Copy link

Propheticus commented May 11, 2024

It's unexpected, but the chunk length value is in characters not in tokens. So it's smaller by a factor 3 to 4. (So is the sequence length setting for the embedding model btw)
800 tokens would be a setting of ~3000 characters.

@timothycarambat
Copy link
Member

@Propheticus is correct, the spit size is by chars, not tokens. This is actually intentional and known as counting tokens depends on the tokenizer of the model, which we cannot really replicate since we only have access to the tokenizer that comes with tiktoken and the reason we dont rely on that is that it can be expensive to calculate for each embedding.

I do think though we are at a point where we can probably rely on that library more as we for sure "underestimate" the token length since we count by chars, which is indeed off by some factor depending on tokenizer

@timothycarambat timothycarambat added enhancement New feature or request and removed possible bug Bug was reported but is not confirmed or is unable to be replicated. labels May 12, 2024
@RahSwe
Copy link
Author

RahSwe commented May 12, 2024

@Propheticus @timothycarambat

Mistake on my side, I did not realize the setting was for characters not tokens. Which means that I will now re-vectorize all of my embeddings.

However it looks like the citations are still too small even when counting characters and bot tokens?

@RahSwe
Copy link
Author

RahSwe commented May 12, 2024

@Propheticus @timothycarambat

Further testing, it seems that something is off:

  • I increased the chunk size from 800 characters to 4 000 characters (using OpenAI embedding model and LanceDB).
  • I then deleted all vectors using the workspace option
  • I then made new vectors
  • I then made test queries. The citations show numbers of characters in the range 700-900 characters.

@RahSwe
Copy link
Author

RahSwe commented May 12, 2024

@Propheticus @timothycarambat

Further testing:

  • I tried not only resetting the workspace vector database but also deleting all the files, re-upload the files and then embed.
  • I now get chunks and citations approximately corresponding to the custom chunk size. However it is not exactly corresponding to the 4 000 character setting, as an example:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants