[Performance]: large rate of decrease in generation throughput when SamplingParams.logprobs increases #4699

jeffrey-fong · 2024-05-09T04:51:56Z

Proposal to improve performance

No response

Report of performance regression

Model: meta-llama/Meta-Llama-3-8B-Instruct
GPU: 1x A6000

SamplingParams.logprobs	Generation Throughput (vLLM==0.4.2) (tokens/s)	Generation Throughput (vLLM==0.3.0) (tokens/s)
100	37.76435	39.47888
1000	27.62177	38.71567
10000	9.62248	36.21264
20000	5.36203	35.83716
len(tokenizer.vocab.keys())	1.00312	22.60807
len(tokenizer.vocab.keys()) + fix post-processing	6.03333	N.A.

Misc discussion on performance

In my project, I have a vLLM server and during inference, I require to get all the log probs for every generated token as I need to do further post-processing/sampling. Getting all the log probs didn't result in visible drop in generation throughput previously but it became especially obvious since I migrated to vLLM==0.4.1.

In vllm==0.4.1 and 0.4.2, there is a large rate of decrease in the generation throughput when SamplingParams.logprobs increases. It can get as low as 1 token/s using meta-llama/Meta-Llama-3-8B-Instruct on an A6000 with engine_args.max_logprobs == sampling_params.logprobs == len(tokenizer.vocab.keys()) which is 128256 for llama-3. Back in v0.3.0 (the previous pinned version in my project), the rate of decrease was not so large.

After investigating, there seems to be two main bottlenecks in the inference:

Oddly slow detokenizing in post-processing stage in this method. This took on average 0.7s per token when sampling_params.logprobs == len(tokenizer.vocab.keys()). In my experimentation, I limited the detokenizing to the top-20 tokens based on logprobs similar to OpenAI's Chat Completion Request where top_logprobs has an upper bound of 20 and this increased the throughput to 6.03333 tokens/s. Do you think we should implement this top-n-only detokenizing logic since OpenAI also limits the top_logprobs to 20 only? I can a raise a PR for this if this is what you would would like.
The execution of the model has become slow and unstable when sampling_params.logprobs is large. I measured the time taken to complete this function and it is unstable with some tokens occasionally taking 0.3-0.5s instead of the "normal" 0.11s. I'm not sure what's the reason for this. I wonder if any of you encountered this before?

self.model_executor.execute_model_async: 0.2736055850982666
self.model_executor.execute_model_async: 0.2578315734863281
self.model_executor.execute_model_async: 0.2712900638580322
self.model_executor.execute_model_async: 0.11722660064697266
self.model_executor.execute_model_async: 0.28998589515686035
self.model_executor.execute_model_async: 0.11808204650878906
self.model_executor.execute_model_async: 0.30967116355895996
self.model_executor.execute_model_async: 0.11808013916015625
self.model_executor.execute_model_async: 0.34619808197021484
self.model_executor.execute_model_async: 0.11945867538452148
self.model_executor.execute_model_async: 0.11967015266418457
self.model_executor.execute_model_async: 0.4465758800506592
self.model_executor.execute_model_async: 0.12009191513061523
self.model_executor.execute_model_async: 0.11814165115356445
self.model_executor.execute_model_async: 0.11848926544189453
self.model_executor.execute_model_async: 0.43151330947875977
self.model_executor.execute_model_async: 0.12377119064331055
self.model_executor.execute_model_async: 0.11845278739929199
self.model_executor.execute_model_async: 0.11834073066711426
self.model_executor.execute_model_async: 0.5058283805847168
self.model_executor.execute_model_async: 0.11896276473999023
self.model_executor.execute_model_async: 0.11865472793579102

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

The text was updated successfully, but these errors were encountered:

jeffrey-fong · 2024-05-09T10:17:54Z

I just realized that there is a detokenize attribute in the SamplingParams class that is only exposed at the engine level for now. This can help me solve the first bottleneck for my use case. However, I am still trying to investigate the cause of bottleneck 2. If you know the cause of the problem or discovers it, do let me know!

jeffrey-fong added the performance Performance-related issues label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: large rate of decrease in generation throughput when SamplingParams.logprobs increases #4699

[Performance]: large rate of decrease in generation throughput when SamplingParams.logprobs increases #4699

jeffrey-fong commented May 9, 2024

jeffrey-fong commented May 9, 2024

[Performance]: large rate of decrease in generation throughput when SamplingParams.logprobs increases #4699

[Performance]: large rate of decrease in generation throughput when SamplingParams.logprobs increases #4699

Comments

jeffrey-fong commented May 9, 2024

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

jeffrey-fong commented May 9, 2024