Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

模型上线后,推理报错 #217

Open
sevenold opened this issue Dec 21, 2023 · 0 comments
Open

模型上线后,推理报错 #217

sevenold opened this issue Dec 21, 2023 · 0 comments
Assignees

Comments

@sevenold
Copy link

sevenold commented Dec 21, 2023

RT: 0.0.5

model config

{
  "parameters": {
    "type": "dataelem.pymodel.vllm_model",
    "decoupled": "1",
    "pymodel_type": "llm.vLLMQwen7bChat",
    "pymodel_params": "{\"temperature\": 0.0, \"stop\": [\"<|im_end|>\", \"<|im_start|>\",\"<|endoftext|>\"]}",
    "gpu_memory": "20",
    "instance_groups": "device=gpu;gpus=0",
    "reload": "1",
    "verbose": "0"
  }
}
import requests
url = "http://xxx:9001/v2.1/models/Qwen-7B-Chat/infer"

# LLM输入格式兼容openai
inp1 = {
  "model": "Qwen-7B-Chat",
  "messages": [
    {"role": "user", "content": "hello"}
   ]
}

# LLM输出格式基本兼容openai,差异: 本地模型无token统计使用信息
outp1 = requests.post(url=url, json=inp1)
print(outp1.content)

b'{"error":"expected a single response, got 2"}'

RT log

2023-12-21T08:47:57.269073390Z using_decoupled True
2023-12-21T08:48:00.018871598Z INFO 12-21 16:48:00 llm_engine.py:72] Initializing an LLM engine with config: model='./models/model_repository/Qwen-7B-Chat', tokenizer='./models/model_repository/Qwen-7B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
2023-12-21T08:48:00.558397300Z WARNING 12-21 16:48:00 tokenizer.py:66] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
2023-12-21T08:48:16.925095174Z INFO 12-21 16:48:16 llm_engine.py:207] # GPU blocks: 394, # CPU blocks: 512
2023-12-21T08:48:42.067733569Z INFO 12-21 16:48:42 llm_engine.py:624] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%

2023-12-21T03:56:58Z I 1 metrics.cc:870] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090
2023-12-21T03:56:58Z I 1 metrics.cc:761] Collecting CPU metrics
2023-12-21T03:56:58Z I 1 grpc_server.cc:4822] Started GRPCInferenceService at 0.0.0.0:9000
2023-12-21T03:56:58Z I 1 http_server.cc:4446] Started HTTPService at 0.0.0.0:9001
2023-12-21T03:56:58Z I 1 http_server.cc:190] Started Metrics Service at 0.0.0.0:9002
2023-12-21T04:00:18Z I 1 model_lifecycle.cc:459] loading: Qwen-7B-Chat:1
2023-12-21T04:00:21Z I 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: Qwen-7B-Chat_0 (GPU device 0)
2023-12-21T04:00:52Z I 1 model_lifecycle.cc:693] successfully loaded 'Qwen-7B-Chat' version 1
2023-12-21T04:15:15Z E 1 http_server.cc:3554] [INTERNAL] received a response without FINAL flag
2023-12-21T04:15:29Z E 1 http_server.cc:3554] [INTERNAL] received a response without FINAL flag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants