Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: 'history is longer than the max chat context' error #338

Open
Biskwit opened this issue Jan 1, 2024 · 18 comments
Open

Question: 'history is longer than the max chat context' error #338

Biskwit opened this issue Jan 1, 2024 · 18 comments

Comments

@Biskwit
Copy link

Biskwit commented Jan 1, 2024

Hi,
i try to build a simple RAG script to load a pdf file (~8 pages, that is not very large but maybe i'm wrong), at the first question i ask, i've the the error:
The message history is longer than the max chat context, length allowed, and we have run out of messages to drop.

Code:

llm = lr.language_models.OpenAIGPTConfig(
        api_base="http://localhost:8000/v1",
        chat_model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
        completion_model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
        use_chat_for_completion=False,
        max_output_tokens=4096,
        temperature=0.2
)

config = DocChatAgentConfig(
    default_paths= [],
    max_context_tokens=8192,
    conversation_mode=True,
    llm = llm,
    relevance_extractor_config = lr.agent.special.RelevanceExtractorAgentConfig(
        llm=llm
    ),
    vecdb=lr.vector_store.QdrantDBConfig(
        collection_name="test1",
        replace_collection=True, 
        embedding=hf_embed_config
    ),
    doc_paths= [
        "./test1/2312.17238.pdf"
    ],
)

I tried to change some variables but without any effect (max_context_tokens, max_output_tokens ...) even with a max context tokens at 32000.

Did i forget something or doing something wrong on the doc load ? Or my pdf is too large ?

Thanks for your work 👍

@pchalasani
Copy link
Contributor

pchalasani commented Jan 1, 2024

There is another param in the OpenAIGPTConfig: chat_context_length, which defines the total (input + output) context length, and it defaults to 1024.
When I ran this paper thru your settings, it eventually asked for the model response with:

  • history length = 1471 tokens
  • max_output_tokens=4096 (which is a lot, so you would want to shorten it to say 500)
  • chat_context_length = 1024,
    so clearly history + max_output_tokens > chat_context_length, which lead to the error you are seeing.

I suggest first verifying that simple chat works for your local LLM setup. I don't know the details of how you spun up the mistral model, but if you already have it spun up locally and listening at http://localhost:8000/v1, then you only need to set the config like this:

import langroid as lr
import langroid.language_models as lm
llm = lm.OpenAIGPTConfig(
    chat_model="local/localhost:8000/v1", # <-- note you must have "local/" at the beginning
    use_chat_for_completion=True,
    chat_context_length=4096,                   # set this based on model
    max_output_tokens=100,
    temperature=0.2
)

Before trying RAG, I suggest first verifying simple chat works with these settings (or change them if your local LLM serving setup differs from my assumption):

agent = lr.ChatAgent(lr.ChatAgentConfig(llm=llm))
agent.llm_response("What is 3+4?")

Once this works, then try the document-chat with this config --

config = DocChatAgentConfig(
    default_paths= [],
    conversation_mode=True,
    llm = llm,
    relevance_extractor_config = lr.agent.special.RelevanceExtractorAgentConfig(
        llm=llm
    ),
    vecdb=lr.vector_store.QdrantDBConfig(
        collection_name="test1",
        replace_collection=True,
        embedding=hf_embed_config
    ),
    doc_paths= [
        "https://arxiv.org/pdf/2312.17238.pdf",
    ],
)

(Note that max_context_tokens is a vestigial param that's not used, and will be removed in the upcoming release)

@pchalasani
Copy link
Contributor

pchalasani commented Jan 2, 2024

Ok I've setup a working example using ollama to run your RAG example with mistral:7b-instruct-v0.2-q4_K_M see this example script in the langroid-examples repo:
https://github.com/langroid/langroid-examples/blob/main/examples/docqa/rag-local-simple.py

This works for me on an M1 Mac. Here's a sample run:

image

@Biskwit
Copy link
Author

Biskwit commented Jan 2, 2024

There is another param in the OpenAIGPTConfig: chat_context_length, which defines the total (input + output) context length, and it defaults to 1024. When I ran this paper thru your settings, it eventually asked for the model response with:

  • history length = 1471 tokens
  • max_output_tokens=4096 (which is a lot, so you would want to shorten it to say 500)
  • chat_context_length = 1024,
    so clearly history + max_output_tokens > chat_context_length, which lead to the error you are seeing.

I suggest first verifying that simple chat works for your local LLM setup. I don't know the details of how you spun up the mistral model, but if you already have it spun up locally and listening at http://localhost:8000/v1, then you only need to set the config like this:

import langroid as lr
import langroid.language_models as lm
llm = lm.OpenAIGPTConfig(
    chat_model="local/localhost:8000/v1", # <-- note you must have "local/" at the beginning
    use_chat_for_completion=True,
    chat_context_length=4096,                   # set this based on model
    max_output_tokens=100,
    temperature=0.2
)

Before trying RAG, I suggest first verifying simple chat works with these settings (or change them if your local LLM serving setup differs from my assumption):

agent = lr.ChatAgent(lr.ChatAgentConfig(llm=llm))
agent.llm_response("What is 3+4?")

Once this works, then try the document-chat with this config --

config = DocChatAgentConfig(
    default_paths= [],
    conversation_mode=True,
    llm = llm,
    relevance_extractor_config = lr.agent.special.RelevanceExtractorAgentConfig(
        llm=llm
    ),
    vecdb=lr.vector_store.QdrantDBConfig(
        collection_name="test1",
        replace_collection=True,
        embedding=hf_embed_config
    ),
    doc_paths= [
        "https://arxiv.org/pdf/2312.17238.pdf",
    ],
)

(Note that max_context_tokens is a vestigial param that's not used, and will be removed in the upcoming release)

Thank you very much for your help, i didn't see the chat_context_length param my bad, it's all good now.

ps: for the chat_model, i can't use the local/http://localhost:8000/v1, i'm using vllm to run my model and i get some errors (as i remember i had 404 cause he try to find model called local/http://localhost:8000/v1, but i retested and i just have Connection error..) but I think it's related to vllm, which may not be compatible with the Langroid request logic.

@pchalasani
Copy link
Contributor

use the local/http://localhost:8000/v1

Note that the syntax is “local/localhost:8000/v1”
I.e you shouldn’t include the “http://”

@Biskwit
Copy link
Author

Biskwit commented Jan 2, 2024

Yep, this the thing that generate a 404:

WARNING - OpenAI API request failed with error:
Error code: 404 - {'object': 'error', 'message': 'The model local/localhost:8000/v1 does not exist.', 'type': 'invalid_request_error', 'param': None, 'code': None}.

@pchalasani
Copy link
Contributor

pchalasani commented Jan 2, 2024

That's puzzling. Looking at the vllm docs,
it should launch an OpenAI-compatible endpoint at http://http://localhost:8000/v1, and langroid should then work with setting OpenAIGPTConfig(chat_model="local/localhost:8000/v1"). The only thing Langroid expects is an OpenAI-compatible endpoint.

The err msg The model ... does not exist you showed indicates that it is trying to use an OpenAI model, hence complaining that the model is invalid. This shouldn't be happening if your OpenAIGPTConfig.chat_model is set as local/... or litellm/.... In these cases it automatically enables "local model mode".

@Biskwit
Copy link
Author

Biskwit commented Jan 2, 2024

That's puzzling. Looking at the vllm docs, it should launch an OpenAI-compatible endpoint at http://http://localhost:8000/v1, and langroid should then work with setting OpenAIGPTConfig(chat_model="local/localhost:8000/v1"). The only thing Langroid expects is an OpenAI-compatible endpoint.

The err msg The model ... does not exist you showed indicates that it is trying to use an OpenAI model, hence complaining that the model is invalid. This shouldn't be happening if your OpenAIGPTConfig.chat_model is set as local/... or litellm/.... In these cases it automatically enables "local model mode".

I think it's because this check exists, and it's not present on Litellm and Ollama (?) https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py#L131-L138.

@pchalasani
Copy link
Contributor

pchalasani commented Jan 2, 2024

Can you post your exact OpenAIGPTConfig setting here? I haven't tested with vllm, but this may be helpful: https://docs.litellm.ai/docs/providers/vllm

You might try setting chat_model="litellm/vllm/[model-name]", though if your model is actually listening at localhost:8000/v1 then this shouldn't be needed. I don't know your actual config so it's hard to say

@Biskwit
Copy link
Author

Biskwit commented Jan 2, 2024

When i try chat_model="litellm/vllm/[model-name]", it download the model and run it

this is my working config atm :

llm = lr.language_models.OpenAIGPTConfig(
        api_base="http://localhost:8000/v1",
        chat_model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
        completion_model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
        use_chat_for_completion=True,
        max_output_tokens=256,
        temperature=0.2,
        chat_context_length=4096
)

@pchalasani
Copy link
Contributor

it download the model and run it

Did you mean to say it's working now?

@Biskwit
Copy link
Author

Biskwit commented Jan 4, 2024

it download the model and run it

Did you mean to say it's working now?

i mean it download and run the llm, but this not what i except for my usecase, i already have a running llm, so i just want to plug langroid to it. This is why i use the config that i linked.

@Shaktizala
Copy link

Shaktizala commented Feb 15, 2024

Ok I've setup a working example using ollama to run your RAG example with mistral:7b-instruct-v0.2-q4_K_M see this example script in the langroid-examples repo: https://github.com/langroid/langroid-examples/blob/main/examples/docqa/rag-local-simple.py

This works for me on an M1 Mac. Here's a sample run:

image

I followed the same steps, but the problem is that it works fine with simple examples/chat, but when it comes to using DocChat it doesn't give any response.

It just shows the logs, like retrieving objects and all, but in the end, it doesn't give any output/response.

@pchalasani Can you please help me with this? I'm using it on my Arch Linux machine and using the zsh terminal.

PS: I am using it for creating the RAG application; I have access to the Mistral-7b via oogabooga-text-gen-web-ui and using same in langroid. Also guide me if I'm missing something coz I'm new to this things.

@pchalasani
Copy link
Contributor

but when it comes to using DocChat it doesn't give any response.

Are you running exactly the rag-local-simple.py script? Let me know the exact script syntax you are using, and any changes you made to rag-local-simple.py.

The correct syntax when using ollama is:

python3 examples/docqa/rag-local-simple.py -m ollama/mistral:7b-instruct-v0.2-q8_0

or if you use ooba to serve this model at 127.0.0.1:5000/v1

python3 examples/docqa/rag-local-simple.py -m local/127.0.0.1:5000/v1

@pchalasani
Copy link
Contributor

If you ran it correctly, the fact you're not getting a response could mean that for that specific question there was no answer found. If you're not getting a response for any question at all (especially "obvious" ones that it should find), then that needs looking into.

@Shaktizala
Copy link

Shaktizala commented Feb 15, 2024

but when it comes to using DocChat it doesn't give any response.

Are you running exactly the rag-local-simple.py script? Let me know the exact script syntax you are using, and any changes you made to rag-local-simple.py.

The correct syntax when using ollama is:

python3 examples/docqa/rag-local-simple.py -m ollama/mistral:7b-instruct-v0.2-q8_0

or if you use ooba to serve this model at 127.0.0.1:5000/v1

python3 examples/docqa/rag-local-simple.py -m local/127.0.0.1:5000/v1
llm_config = lm.OpenAIGPTConfig(
        # if you comment out `chat_model`, it will default to OpenAI GPT4-turbo
        # chat_model="ollama/mistral:7b-instruct-v0.2-q4_K_M",
        chat_model="local/10.0.**.***:5000/v1",
        chat_context_length=32_000,  # set this based on model
        max_output_tokens=100,
        temperature=0.2,
        stream=True,
        timeout=45,
    )

I made this changes.

@pchalasani
Copy link
Contributor

I made this changes.

that should work, for a good enough local model. (I assume you don't actually use **.*** in the URL, just masking it here).

@Shaktizala
Copy link

Shaktizala commented Feb 16, 2024

I made this changes.

that should work, for a good enough local model. (I assume you don't actually use **.*** in the URL, just masking it here).

yes, but still it's not giving response nor any errors.

Do I have to provide an openAI key? As I am using Mistral-7B which is locally deployed.

@pchalasani
Copy link
Contributor

Can you check with other documents and/or more variety of questions and see if you’re still not getting a response? You can also try the bare chat mode (I e not document q/a) as suggested in the comments in the script, to ensure that your local LLM setup is working.

And finally, if you can make a reproducible example (or give me a specific doc and specific question) I can see if I can reproduce this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants