Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain_nvidia_trt not working #108

Open
rbgo404 opened this issue Apr 19, 2024 · 3 comments
Open

langchain_nvidia_trt not working #108

rbgo404 opened this issue Apr 19, 2024 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@rbgo404
Copy link

rbgo404 commented Apr 19, 2024

I have gone through the notebooks but couldn't able to stream the tokens from the TensorRTLLM.
Here's the issue:
image

Code used:

from langchain_nvidia_trt.llms import TritonTensorRTLLM
import time
import random

triton_url = "localhost:8001"
pload = {
            'tokens':300,
            'server_url': triton_url,
            'model_name': "ensemble",
            'temperature':1.0,
            'top_k':1,
            'top_p':0,
            'beam_width':1,
            'repetition_penalty':1.0,
            'length_penalty':1.0
}
client = TritonTensorRTLLM(**pload)

LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "{system_prompt}"
 "<</SYS>>"
 "[/INST] {context} </s><s>[INST] {question} [/INST]"
)
system_prompt = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Please ensure that your responses are positive in nature."
context=""
question='What is the fastest land animal?'
prompt = LLAMA_PROMPT_TEMPLATE.format(system_prompt=system_prompt, context=context, question=question)

start_time = time.time()
tokens_generated = 0

for val in client._stream(prompt):
    tokens_generated += 1
    print(val, end="", flush=True)

total_time = time.time() - start_time
print(f"\n--- Generated {tokens_generated} tokens in {total_time} seconds ---")
print(f"--- {tokens_generated/total_time} tokens/sec")
@rbgo404
Copy link
Author

rbgo404 commented Apr 19, 2024

Please share the configuration in the TensorRT-LLM end. What are the parameters modification required in the model's config.pbtxt

@shubhadeepd
Copy link
Collaborator

shubhadeepd commented Apr 22, 2024

Hey @rbgo404
You can deploy the tensorRT-based LLM model by following the steps here
https://nvidia.github.io/GenerativeAIExamples/latest/local-gpu.html#using-local-gpus-for-a-q-a-chatbot

This notebook interacts with the model deployed behind llm-inference-server container which should get started up if you follow the steps above.

Let me know if you have any questions once you go through these steps!

@shubhadeepd shubhadeepd self-assigned this Apr 22, 2024
@shubhadeepd shubhadeepd added the question Further information is requested label Apr 22, 2024
@ChiBerkeley
Copy link

Hi, I followed the instruction but still has problem starting llm-inference-server. I'm currently using Tesla M60 and llama-2-13b-chat
Screenshot from 2024-04-30 23-08-17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants