[Bug]: Token Counter is not working when transformations are applied with an embedding model. #13146

dsanr · 2024-04-28T17:42:13Z

Bug Description

Token Counter is not working when transformations are applied along with an embedding model. So, even the Ingestion Pipeline is not useful if we want to use the Token Counter.

Version

0.10.33

Steps to Reproduce

documents = SimpleDirectoryReader(input_dir="./data", filename_as_id=True).load_data(show_progress=True)
token_counter = TokenCountingHandler(tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode, verbose=True)
Settings.callback_manager = CallbackManager([token_counter])
transformations = [SentenceSplitter(chunk_size=512, chunk_overlap=128), OpenAIEmbedding()]
index = VectorStoreIndex.from_documents(documents, transformations=transformations)
token_counter.total_embedding_token_count is returning zero.
If I remove OpenAIEmbedding() from transformations and pass it in VectorStoreIndex.from_documents(), then token counter is working.

Relevant Logs/Tracbacks

No response

The text was updated successfully, but these errors were encountered:

dosubot · 2024-04-28T17:42:42Z

It seems the issue with total_embedding_token_count returning zero when using transformations alongside an OpenAIEmbedding model might stem from how embedding events and their tokens are handled. Specifically, if the embedding transformation doesn't generate or populate EventPayload.CHUNKS as expected, or if the TokenCountingHandler isn't processing these chunks correctly, it could lead to the observed behavior.

To address this, ensure the payload for embedding events includes EventPayload.CHUNKS with the correct data. Confirm that embedding events are properly identified and processed in the on_event_end method, particularly that chunks are being processed and token counts are accurately calculated for each chunk. Additionally, verify that any transformations applied before token counting do not alter the data in a way that affects the token counting logic.

Reviewing the integration of the embedding model and transformations to maintain the expected data structure and flow is crucial. Debugging with logging or debugging statements within the handling of embedding events can help identify where the process might be failing.

By closely examining these areas, you should be able to pinpoint and resolve the issue causing the token counter to report zero in the presence of embedding transformations.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

logan-markewich · 2024-04-29T18:17:52Z

@dsanr you should attach the callback manager to the embedding model (and also the LLM tbh)

OpenAIEmbedding(..., callback_manager=callback_manager)

dsanr · 2024-04-29T20:52:54Z

@logan-markewich Shouldn't be using Settings.callback_manager work?
Documentation of token_counter also has this https://docs.llamaindex.ai/en/stable/examples/callbacks/TokenCountingHandler/

dsanr added bug Something isn't working triage Issue needs to be triaged/prioritized labels Apr 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Token Counter is not working when transformations are applied with an embedding model. #13146

[Bug]: Token Counter is not working when transformations are applied with an embedding model. #13146

dsanr commented Apr 28, 2024

dosubot bot commented Apr 28, 2024 •

edited

Details

logan-markewich commented Apr 29, 2024

dsanr commented Apr 29, 2024

[Bug]: Token Counter is not working when transformations are applied with an embedding model. #13146

[Bug]: Token Counter is not working when transformations are applied with an embedding model. #13146

Comments

dsanr commented Apr 28, 2024

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented Apr 28, 2024 • edited

Details

logan-markewich commented Apr 29, 2024

dsanr commented Apr 29, 2024

dosubot bot commented Apr 28, 2024 •

edited