You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my project, I have a vLLM server and during inference, I require to get all the log probs for every generated token as I need to do further post-processing/sampling. Getting all the log probs didn't result in visible drop in generation throughput previously but it became especially obvious since I migrated to vLLM==0.4.1.
In vllm==0.4.1 and 0.4.2, there is a large rate of decrease in the generation throughput when SamplingParams.logprobs increases. It can get as low as 1 token/s using meta-llama/Meta-Llama-3-8B-Instruct on an A6000 with engine_args.max_logprobs == sampling_params.logprobs == len(tokenizer.vocab.keys()) which is 128256 for llama-3. Back in v0.3.0 (the previous pinned version in my project), the rate of decrease was not so large.
After investigating, there seems to be two main bottlenecks in the inference:
Oddly slow detokenizing in post-processing stage in this method. This took on average 0.7s per token when sampling_params.logprobs == len(tokenizer.vocab.keys()). In my experimentation, I limited the detokenizing to the top-20 tokens based on logprobs similar to OpenAI's Chat Completion Request where top_logprobs has an upper bound of 20 and this increased the throughput to 6.03333 tokens/s. Do you think we should implement this top-n-only detokenizing logic since OpenAI also limits the top_logprobs to 20 only? I can a raise a PR for this if this is what you would would like.
The execution of the model has become slow and unstable when sampling_params.logprobs is large. I measured the time taken to complete this function and it is unstable with some tokens occasionally taking 0.3-0.5s instead of the "normal" 0.11s. I'm not sure what's the reason for this. I wonder if any of you encountered this before?
I just realized that there is a detokenize attribute in the SamplingParams class that is only exposed at the engine level for now. This can help me solve the first bottleneck for my use case. However, I am still trying to investigate the cause of bottleneck 2. If you know the cause of the problem or discovers it, do let me know!
Proposal to improve performance
No response
Report of performance regression
Model: meta-llama/Meta-Llama-3-8B-Instruct
GPU: 1x A6000
Misc discussion on performance
In my project, I have a vLLM server and during inference, I require to get all the log probs for every generated token as I need to do further post-processing/sampling. Getting all the log probs didn't result in visible drop in generation throughput previously but it became especially obvious since I migrated to vLLM==0.4.1.
In vllm==0.4.1 and 0.4.2, there is a large rate of decrease in the generation throughput when
SamplingParams.logprobs
increases. It can get as low as 1 token/s using meta-llama/Meta-Llama-3-8B-Instruct on an A6000 withengine_args.max_logprobs == sampling_params.logprobs == len(tokenizer.vocab.keys())
which is 128256 for llama-3. Back in v0.3.0 (the previous pinned version in my project), the rate of decrease was not so large.After investigating, there seems to be two main bottlenecks in the inference:
sampling_params.logprobs == len(tokenizer.vocab.keys())
. In my experimentation, I limited the detokenizing to the top-20 tokens based on logprobs similar to OpenAI's Chat Completion Request wheretop_logprobs
has an upper bound of 20 and this increased the throughput to 6.03333 tokens/s. Do you think we should implement this top-n-only detokenizing logic since OpenAI also limits the top_logprobs to 20 only? I can a raise a PR for this if this is what you would would like.sampling_params.logprobs
is large. I measured the time taken to complete this function and it is unstable with some tokens occasionally taking 0.3-0.5s instead of the "normal" 0.11s. I'm not sure what's the reason for this. I wonder if any of you encountered this before?Your current environment (if you think it is necessary)
The text was updated successfully, but these errors were encountered: