Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] set logprobs = true and top_logprobs = 5 in restful server. The number of top logrobs is 4 which is unexpected. #1548

Open
2 tasks
zhulinJulia24 opened this issue May 6, 2024 · 3 comments
Assignees

Comments

@zhulinJulia24
Copy link
Collaborator

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

After set logprobs = true and top_logprobs = 5. The number of top logrobs response is not correct, only 4 for each token. I suppose it should be 5 for each token.

Reproduction

  1. start a api_server such as internlm2-chat-20b, such as lmdeploy serve api_server /nvme/qa_test_models/internlm/internlm2-chat-20b --tp 2
  2. open swagger and send request to /v1/chat/completions like this:
{
  "model": "internlm2",
  "messages": [
    {
      "content": "Shanghai is",
      "role": "user"
    }
  ],
  "logprobs": true,
  "top_logprobs": 5,
  "max_tokens": 20
}

image

check the response, only 4 top probs return. I want 5 top probs
image

  1. And I cannot get logprobs return by using /v1/chat/completions api, the script is:
from lmdeploy.serve.openai.api_client import APIClient

api_client = APIClient('http://localhost:23333')
for output in api_client.chat_completions_v1(model='internlm2',
                                            messages='Shanghai is',
                                            logprobs=True,
                                            top_logprobs=5,
                                            max_tokens=20):
    continue
print(output)

logprobs in response is none, it's unexpected.

{'id': '1', 'object': 'chat.completion', 'created': 1714991049, 'model': 'internlm2', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': ' a city of contrasts. The city is both ancient and modern, traditional and innovative, and a world'}, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 4, 'total_tokens': 25, 'completion_tokens': 21}}

Environment

sys.platform: linux
Python: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda-11.7
NVCC: Cuda compilation tools, release 11.7, V11.7.64
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
PyTorch: 2.1.0+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.0+cu118
LMDeploy: 0.4.0+
transformers: 4.40.1
gradio: 4.28.0
fastapi: 0.110.2
pydantic: 2.7.0
triton: 2.1.0

Error traceback

No response

@irexyc
Copy link
Collaborator

irexyc commented May 23, 2024

topk, topp操作会减少在做sample 之前 vocab的候选词的数量。当候选词减少到4个的时候,我觉得没必要非得输出5个词的概率(vllm那边会输出,不过概率是-inf)。如果输出的token不在前5的话,那么输出的top logrobs可能会有6个。

@zhulinJulia24
Copy link
Collaborator Author

zhulinJulia24 commented May 27, 2024

topk, topp操作会减少在做sample 之前 vocab的候选词的数量。当候选词减少到4个的时候,我觉得没必要非得输出5个词的概率(vllm那边会输出,不过概率是-inf)。如果输出的token不在前5的话,那么输出的top logrobs可能会有6个。

是的,但在这种情况下我试过top > 5还是能输出>=5的数字,所以我理解topk, topp操作后候选词应该是多于5个,那么top=5的话应该输出>=5个?

@irexyc
Copy link
Collaborator

irexyc commented May 27, 2024

最少输出一个,最多输出top+1个(所选token不在top里面)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants