Question: 'history is longer than the max chat context' error #338

Biskwit · 2024-01-01T18:45:11Z

Hi,
i try to build a simple RAG script to load a pdf file (~8 pages, that is not very large but maybe i'm wrong), at the first question i ask, i've the the error:
The message history is longer than the max chat context, length allowed, and we have run out of messages to drop.

Code:

llm = lr.language_models.OpenAIGPTConfig(
        api_base="http://localhost:8000/v1",
        chat_model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
        completion_model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
        use_chat_for_completion=False,
        max_output_tokens=4096,
        temperature=0.2
)

config = DocChatAgentConfig(
    default_paths= [],
    max_context_tokens=8192,
    conversation_mode=True,
    llm = llm,
    relevance_extractor_config = lr.agent.special.RelevanceExtractorAgentConfig(
        llm=llm
    ),
    vecdb=lr.vector_store.QdrantDBConfig(
        collection_name="test1",
        replace_collection=True, 
        embedding=hf_embed_config
    ),
    doc_paths= [
        "./test1/2312.17238.pdf"
    ],
)

I tried to change some variables but without any effect (max_context_tokens, max_output_tokens ...) even with a max context tokens at 32000.

Did i forget something or doing something wrong on the doc load ? Or my pdf is too large ?

Thanks for your work 👍

The text was updated successfully, but these errors were encountered:

pchalasani · 2024-01-01T23:02:58Z

There is another param in the OpenAIGPTConfig: chat_context_length, which defines the total (input + output) context length, and it defaults to 1024.
When I ran this paper thru your settings, it eventually asked for the model response with:

history length = 1471 tokens
max_output_tokens=4096 (which is a lot, so you would want to shorten it to say 500)
chat_context_length = 1024,
so clearly history + max_output_tokens > chat_context_length, which lead to the error you are seeing.

I suggest first verifying that simple chat works for your local LLM setup. I don't know the details of how you spun up the mistral model, but if you already have it spun up locally and listening at http://localhost:8000/v1, then you only need to set the config like this:

import langroid as lr
import langroid.language_models as lm
llm = lm.OpenAIGPTConfig(
    chat_model="local/localhost:8000/v1", # <-- note you must have "local/" at the beginning
    use_chat_for_completion=True,
    chat_context_length=4096,                   # set this based on model
    max_output_tokens=100,
    temperature=0.2
)

Before trying RAG, I suggest first verifying simple chat works with these settings (or change them if your local LLM serving setup differs from my assumption):

agent = lr.ChatAgent(lr.ChatAgentConfig(llm=llm))
agent.llm_response("What is 3+4?")

Once this works, then try the document-chat with this config --

config = DocChatAgentConfig(
    default_paths= [],
    conversation_mode=True,
    llm = llm,
    relevance_extractor_config = lr.agent.special.RelevanceExtractorAgentConfig(
        llm=llm
    ),
    vecdb=lr.vector_store.QdrantDBConfig(
        collection_name="test1",
        replace_collection=True,
        embedding=hf_embed_config
    ),
    doc_paths= [
        "https://arxiv.org/pdf/2312.17238.pdf",
    ],
)

(Note that max_context_tokens is a vestigial param that's not used, and will be removed in the upcoming release)

pchalasani · 2024-01-02T00:07:21Z

Ok I've setup a working example using ollama to run your RAG example with mistral:7b-instruct-v0.2-q4_K_M see this example script in the langroid-examples repo:
https://github.com/langroid/langroid-examples/blob/main/examples/docqa/rag-local-simple.py

This works for me on an M1 Mac. Here's a sample run:

Biskwit · 2024-01-02T10:55:28Z

There is another param in the OpenAIGPTConfig: chat_context_length, which defines the total (input + output) context length, and it defaults to 1024. When I ran this paper thru your settings, it eventually asked for the model response with:

history length = 1471 tokens

max_output_tokens=4096 (which is a lot, so you would want to shorten it to say 500)

chat_context_length = 1024,
so clearly history + max_output_tokens > chat_context_length, which lead to the error you are seeing.

I suggest first verifying that simple chat works for your local LLM setup. I don't know the details of how you spun up the mistral model, but if you already have it spun up locally and listening at http://localhost:8000/v1, then you only need to set the config like this:
import langroid as lr
import langroid.language_models as lm
llm = lm.OpenAIGPTConfig(
    chat_model="local/localhost:8000/v1", # <-- note you must have "local/" at the beginning
    use_chat_for_completion=True,
    chat_context_length=4096,                   # set this based on model
    max_output_tokens=100,
    temperature=0.2
)
Before trying RAG, I suggest first verifying simple chat works with these settings (or change them if your local LLM serving setup differs from my assumption):
agent = lr.ChatAgent(lr.ChatAgentConfig(llm=llm))
agent.llm_response("What is 3+4?")
Once this works, then try the document-chat with this config --
config = DocChatAgentConfig(
    default_paths= [],
    conversation_mode=True,
    llm = llm,
    relevance_extractor_config = lr.agent.special.RelevanceExtractorAgentConfig(
        llm=llm
    ),
    vecdb=lr.vector_store.QdrantDBConfig(
        collection_name="test1",
        replace_collection=True,
        embedding=hf_embed_config
    ),
    doc_paths= [
        "https://arxiv.org/pdf/2312.17238.pdf",
    ],
)
(Note that max_context_tokens is a vestigial param that's not used, and will be removed in the upcoming release)

Thank you very much for your help, i didn't see the chat_context_length param my bad, it's all good now.

ps: for the chat_model, i can't use the local/http://localhost:8000/v1, i'm using vllm to run my model and i get some errors (as i remember i had 404 cause he try to find model called local/http://localhost:8000/v1, but i retested and i just have Connection error..) but I think it's related to vllm, which may not be compatible with the Langroid request logic.

pchalasani · 2024-01-02T11:47:03Z

use the local/http://localhost:8000/v1

Note that the syntax is “local/localhost:8000/v1”
I.e you shouldn’t include the “http://”

Biskwit · 2024-01-02T12:00:46Z

Yep, this the thing that generate a 404:

WARNING - OpenAI API request failed with error:
Error code: 404 - {'object': 'error', 'message': 'The model local/localhost:8000/v1 does not exist.', 'type': 'invalid_request_error', 'param': None, 'code': None}.

pchalasani · 2024-01-02T14:29:49Z

That's puzzling. Looking at the vllm docs,
it should launch an OpenAI-compatible endpoint at http://http://localhost:8000/v1, and langroid should then work with setting OpenAIGPTConfig(chat_model="local/localhost:8000/v1"). The only thing Langroid expects is an OpenAI-compatible endpoint.

The err msg The model ... does not exist you showed indicates that it is trying to use an OpenAI model, hence complaining that the model is invalid. This shouldn't be happening if your OpenAIGPTConfig.chat_model is set as local/... or litellm/.... In these cases it automatically enables "local model mode".

Biskwit · 2024-01-02T15:00:03Z

That's puzzling. Looking at the vllm docs, it should launch an OpenAI-compatible endpoint at http://http://localhost:8000/v1, and langroid should then work with setting OpenAIGPTConfig(chat_model="local/localhost:8000/v1"). The only thing Langroid expects is an OpenAI-compatible endpoint.

The err msg The model ... does not exist you showed indicates that it is trying to use an OpenAI model, hence complaining that the model is invalid. This shouldn't be happening if your OpenAIGPTConfig.chat_model is set as local/... or litellm/.... In these cases it automatically enables "local model mode".

I think it's because this check exists, and it's not present on Litellm and Ollama (?) https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py#L131-L138.

pchalasani · 2024-01-02T16:47:23Z

Can you post your exact OpenAIGPTConfig setting here? I haven't tested with vllm, but this may be helpful: https://docs.litellm.ai/docs/providers/vllm

You might try setting chat_model="litellm/vllm/[model-name]", though if your model is actually listening at localhost:8000/v1 then this shouldn't be needed. I don't know your actual config so it's hard to say

Biskwit · 2024-01-02T17:38:51Z

When i try chat_model="litellm/vllm/[model-name]", it download the model and run it

this is my working config atm :

llm = lr.language_models.OpenAIGPTConfig(
        api_base="http://localhost:8000/v1",
        chat_model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
        completion_model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
        use_chat_for_completion=True,
        max_output_tokens=256,
        temperature=0.2,
        chat_context_length=4096
)

pchalasani · 2024-01-03T19:12:15Z

it download the model and run it

Did you mean to say it's working now?

Biskwit · 2024-01-04T16:51:53Z

it download the model and run it

Did you mean to say it's working now?

i mean it download and run the llm, but this not what i except for my usecase, i already have a running llm, so i just want to plug langroid to it. This is why i use the config that i linked.

Shaktizala · 2024-02-15T05:14:51Z

Ok I've setup a working example using ollama to run your RAG example with mistral:7b-instruct-v0.2-q4_K_M see this example script in the langroid-examples repo: https://github.com/langroid/langroid-examples/blob/main/examples/docqa/rag-local-simple.py

This works for me on an M1 Mac. Here's a sample run:

I followed the same steps, but the problem is that it works fine with simple examples/chat, but when it comes to using DocChat it doesn't give any response.

It just shows the logs, like retrieving objects and all, but in the end, it doesn't give any output/response.

@pchalasani Can you please help me with this? I'm using it on my Arch Linux machine and using the zsh terminal.

PS: I am using it for creating the RAG application; I have access to the Mistral-7b via oogabooga-text-gen-web-ui and using same in langroid. Also guide me if I'm missing something coz I'm new to this things.

pchalasani · 2024-02-15T13:19:50Z

but when it comes to using DocChat it doesn't give any response.

Are you running exactly the rag-local-simple.py script? Let me know the exact script syntax you are using, and any changes you made to rag-local-simple.py.

The correct syntax when using ollama is:

python3 examples/docqa/rag-local-simple.py -m ollama/mistral:7b-instruct-v0.2-q8_0

or if you use ooba to serve this model at 127.0.0.1:5000/v1

python3 examples/docqa/rag-local-simple.py -m local/127.0.0.1:5000/v1

pchalasani · 2024-02-15T13:20:59Z

If you ran it correctly, the fact you're not getting a response could mean that for that specific question there was no answer found. If you're not getting a response for any question at all (especially "obvious" ones that it should find), then that needs looking into.

Shaktizala · 2024-02-15T15:04:21Z

but when it comes to using DocChat it doesn't give any response.

Are you running exactly the rag-local-simple.py script? Let me know the exact script syntax you are using, and any changes you made to rag-local-simple.py.

The correct syntax when using ollama is:
python3 examples/docqa/rag-local-simple.py -m ollama/mistral:7b-instruct-v0.2-q8_0
or if you use ooba to serve this model at 127.0.0.1:5000/v1
python3 examples/docqa/rag-local-simple.py -m local/127.0.0.1:5000/v1

llm_config = lm.OpenAIGPTConfig(
        # if you comment out `chat_model`, it will default to OpenAI GPT4-turbo
        # chat_model="ollama/mistral:7b-instruct-v0.2-q4_K_M",
        chat_model="local/10.0.**.***:5000/v1",
        chat_context_length=32_000,  # set this based on model
        max_output_tokens=100,
        temperature=0.2,
        stream=True,
        timeout=45,
    )

I made this changes.

pchalasani · 2024-02-15T15:10:02Z

I made this changes.

that should work, for a good enough local model. (I assume you don't actually use **.*** in the URL, just masking it here).

Shaktizala · 2024-02-16T05:12:04Z

I made this changes.

that should work, for a good enough local model. (I assume you don't actually use **.*** in the URL, just masking it here).

yes, but still it's not giving response nor any errors.

Do I have to provide an openAI key? As I am using Mistral-7B which is locally deployed.

pchalasani · 2024-02-16T12:34:35Z

Can you check with other documents and/or more variety of questions and see if you’re still not getting a response? You can also try the bare chat mode (I e not document q/a) as suggested in the comments in the script, to ensure that your local LLM setup is working.

And finally, if you can make a reproducible example (or give me a specific doc and specific question) I can see if I can reproduce this issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: 'history is longer than the max chat context' error #338

Question: 'history is longer than the max chat context' error #338

Biskwit commented Jan 1, 2024

pchalasani commented Jan 1, 2024 •

edited

pchalasani commented Jan 2, 2024 •

edited

Biskwit commented Jan 2, 2024

pchalasani commented Jan 2, 2024

Biskwit commented Jan 2, 2024

pchalasani commented Jan 2, 2024 •

edited

Biskwit commented Jan 2, 2024

pchalasani commented Jan 2, 2024 •

edited

Biskwit commented Jan 2, 2024

pchalasani commented Jan 3, 2024

Biskwit commented Jan 4, 2024

Shaktizala commented Feb 15, 2024 •

edited

pchalasani commented Feb 15, 2024

pchalasani commented Feb 15, 2024

Shaktizala commented Feb 15, 2024 •

edited

pchalasani commented Feb 15, 2024

Shaktizala commented Feb 16, 2024 •

edited

pchalasani commented Feb 16, 2024

Question: 'history is longer than the max chat context' error #338

Question: 'history is longer than the max chat context' error #338

Comments

Biskwit commented Jan 1, 2024

pchalasani commented Jan 1, 2024 • edited

pchalasani commented Jan 2, 2024 • edited

Biskwit commented Jan 2, 2024

pchalasani commented Jan 2, 2024

Biskwit commented Jan 2, 2024

pchalasani commented Jan 2, 2024 • edited

Biskwit commented Jan 2, 2024

pchalasani commented Jan 2, 2024 • edited

Biskwit commented Jan 2, 2024

pchalasani commented Jan 3, 2024

Biskwit commented Jan 4, 2024

Shaktizala commented Feb 15, 2024 • edited

pchalasani commented Feb 15, 2024

pchalasani commented Feb 15, 2024

Shaktizala commented Feb 15, 2024 • edited

pchalasani commented Feb 15, 2024

Shaktizala commented Feb 16, 2024 • edited

pchalasani commented Feb 16, 2024

pchalasani commented Jan 1, 2024 •

edited

pchalasani commented Jan 2, 2024 •

edited

pchalasani commented Jan 2, 2024 •

edited

pchalasani commented Jan 2, 2024 •

edited

Shaktizala commented Feb 15, 2024 •

edited

Shaktizala commented Feb 15, 2024 •

edited

Shaktizala commented Feb 16, 2024 •

edited