[Question]: max_token auto adjust for completion requests but not chat requests #8437

chamilton61 · 2023-10-12T13:53:20Z

chamilton61
Oct 12, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

I recently ran into an issue where our gpt 3 turbo requests were producing too many tokens which would put us over the limit causing an error. After looking into solutions i noticed that completions automatically adjust the max token limit for the request while chat requests don't.

I was curious if i was missing some alternative way of handling this case or if there was a specific reason why the functionality existed for one version but not the other.

@dosu-bot · 2023-10-12T14:02:54Z

dosubot[bot]
bot Oct 12, 2023

🤖

Hello,

You're correct in your observation that the LlamaIndex framework automatically adjusts the max token limit for completion requests but not for chat requests in GPT-3 Turbo. This is due to the different ways these two types of requests handle tokens.

In the provided code, the from_llm_metadata method adjusts the context_window (the maximum context size that will get sent to the LLM) by subtracting 150 if the model is a chat model. This is done to account for the token counting in chat models.

model_name = llm_metadata.model_name
if is_chat_model(model_name):
    context_window -= 150

This adjustment is not made for completion requests because they handle tokens differently. In a chat model, each message and role in the conversation consumes tokens, including system, user, and assistant messages. This is why the context_window is reduced by 150 tokens to account for these additional tokens.

In contrast, completion requests do not have this additional token consumption, so there is no need to adjust the context_window in the same way.

The ChatMemoryBuffer class in the LlamaIndex framework also manages the token limit for chat requests. It uses a token_limit attribute to control the number of tokens in the chat history. If the total number of tokens exceeds this limit, it reduces the number of messages in the chat history until the total number of tokens is within the limit. This is another reason why the max token limit is not automatically adjusted for chat requests in GPT-3 Turbo.

The OpenAI class in the LlamaIndex framework also has a method _update_max_tokens which is used to update the max tokens for completion requests when max_tokens is None. This method uses the tiktoken library to count the number of tokens in the prompt and subtracts this from the context_window to set the max_tokens for the completion request. This is another reason why the max token limit is automatically adjusted for completion requests in GPT-3 Turbo.

I hope this clarifies your question. If you have any further questions, feel free to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

chamilton61 · 2023-10-12T19:11:21Z

chamilton61
Oct 12, 2023
Author

To clarify on the specific situation we were encountering this problem: we were using TreeSummarize on doucments with GPT 3.5 turbo and midway through summarizing a document we would encounter this issue of too many tokens being used for a request.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: max_token auto adjust for completion requests but not chat requests #8437

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

[Question]: max_token auto adjust for completion requests but not chat requests #8437

chamilton61 Oct 12, 2023

Question Validation

Question

Replies: 2 comments

dosubot[bot] bot Oct 12, 2023

Sources

chamilton61 Oct 12, 2023 Author

chamilton61
Oct 12, 2023

dosubot[bot]
bot Oct 12, 2023

chamilton61
Oct 12, 2023
Author