Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify splitting in documentation #42

Open
antonkulaga opened this issue Oct 23, 2023 · 7 comments
Open

clarify splitting in documentation #42

antonkulaga opened this issue Oct 23, 2023 · 7 comments

Comments

@antonkulaga
Copy link

I am using embeddings to embed scientific papers. Usually, I use langchain splitters to split the paper into multiple chunks. However, it is not clear to me if fastembed will do splitting for me or I have to split everything (for which I will have to run embedding tokenizer to evaluate tokens per each paragraph).

@NirantK
Copy link
Collaborator

NirantK commented Oct 24, 2023

FastEmbed will not do the splitting for you. Our default embedding model expects 512 tokens — these tokens are different from the OpenAI tokens!

@NirantK NirantK closed this as completed Oct 24, 2023
@antonkulaga
Copy link
Author

these tokens are different from the OpenAI tokens!

Of course, I have a custom splitter (in my case https://github.com/longevity-genie/indexpaper/blob/main/indexpaper/splitting.py#L119 ) that computes the tokens with the selected HuggingFace model and splits accordingly. The problem is that with this approach I have to run the embedding model once again - for splitting and I do not win much time. If fastembed would have a tokenaware splitter built-in it will save a lot of computation.

@x4080
Copy link

x4080 commented Dec 18, 2023

@antonkulaga Yes I thought the same too, is it possible using fastembed to split the texts ?

@antonkulaga
Copy link
Author

I think the close is premature. You have to measure number of tokens to split the text and for this you need to call embedding one more. As fastembed does not have proper splitting I will have to use way slower langchain implementation that decreases the benefits from fastembed

@x4080
Copy link

x4080 commented Dec 19, 2023

@antonkulaga ok, thanks for the answer

@NirantK NirantK reopened this Jan 5, 2024
@NirantK
Copy link
Collaborator

NirantK commented Jan 5, 2024

Work in Progress here, we're adding a recursive splitter (based on Langchain, but no dependency) based on tokens: #80

Would appreciate folks sharing any feedback!

@x4080
Copy link

x4080 commented Jan 5, 2024

@NirantK cool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants