clarify splitting in documentation #42

antonkulaga · 2023-10-23T23:26:40Z

I am using embeddings to embed scientific papers. Usually, I use langchain splitters to split the paper into multiple chunks. However, it is not clear to me if fastembed will do splitting for me or I have to split everything (for which I will have to run embedding tokenizer to evaluate tokens per each paragraph).

NirantK · 2023-10-24T16:50:47Z

FastEmbed will not do the splitting for you. Our default embedding model expects 512 tokens — these tokens are different from the OpenAI tokens!

antonkulaga · 2023-10-25T22:13:38Z

these tokens are different from the OpenAI tokens!

Of course, I have a custom splitter (in my case https://github.com/longevity-genie/indexpaper/blob/main/indexpaper/splitting.py#L119 ) that computes the tokens with the selected HuggingFace model and splits accordingly. The problem is that with this approach I have to run the embedding model once again - for splitting and I do not win much time. If fastembed would have a tokenaware splitter built-in it will save a lot of computation.

x4080 · 2023-12-18T21:44:29Z

@antonkulaga Yes I thought the same too, is it possible using fastembed to split the texts ?

antonkulaga · 2023-12-18T22:31:56Z

I think the close is premature. You have to measure number of tokens to split the text and for this you need to call embedding one more. As fastembed does not have proper splitting I will have to use way slower langchain implementation that decreases the benefits from fastembed

x4080 · 2023-12-19T21:25:48Z

@antonkulaga ok, thanks for the answer

NirantK · 2024-01-05T09:24:23Z

Work in Progress here, we're adding a recursive splitter (based on Langchain, but no dependency) based on tokens: #80

Would appreciate folks sharing any feedback!

x4080 · 2024-01-05T21:16:52Z

@NirantK cool

NirantK closed this as completed Oct 24, 2023

NirantK reopened this Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clarify splitting in documentation #42

clarify splitting in documentation #42

antonkulaga commented Oct 23, 2023

NirantK commented Oct 24, 2023 •

edited

antonkulaga commented Oct 25, 2023

x4080 commented Dec 18, 2023

antonkulaga commented Dec 18, 2023

x4080 commented Dec 19, 2023

NirantK commented Jan 5, 2024 •

edited

x4080 commented Jan 5, 2024

clarify splitting in documentation #42

clarify splitting in documentation #42

Comments

antonkulaga commented Oct 23, 2023

NirantK commented Oct 24, 2023 • edited

antonkulaga commented Oct 25, 2023

x4080 commented Dec 18, 2023

antonkulaga commented Dec 18, 2023

x4080 commented Dec 19, 2023

NirantK commented Jan 5, 2024 • edited

x4080 commented Jan 5, 2024

NirantK commented Oct 24, 2023 •

edited

NirantK commented Jan 5, 2024 •

edited