Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: large rate of decrease in generation throughput when SamplingParams.logprobs increases #4699

Open
jeffrey-fong opened this issue May 9, 2024 · 1 comment
Labels
performance Performance-related issues

Comments

@jeffrey-fong
Copy link

Proposal to improve performance

No response

Report of performance regression

Model: meta-llama/Meta-Llama-3-8B-Instruct
GPU: 1x A6000

SamplingParams.logprobs Generation Throughput (vLLM==0.4.2) (tokens/s) Generation Throughput (vLLM==0.3.0) (tokens/s)
100 37.76435 39.47888
1000 27.62177 38.71567
10000 9.62248 36.21264
20000 5.36203 35.83716
len(tokenizer.vocab.keys()) 1.00312 22.60807
len(tokenizer.vocab.keys()) + fix post-processing 6.03333 N.A.

Misc discussion on performance

In my project, I have a vLLM server and during inference, I require to get all the log probs for every generated token as I need to do further post-processing/sampling. Getting all the log probs didn't result in visible drop in generation throughput previously but it became especially obvious since I migrated to vLLM==0.4.1.

In vllm==0.4.1 and 0.4.2, there is a large rate of decrease in the generation throughput when SamplingParams.logprobs increases. It can get as low as 1 token/s using meta-llama/Meta-Llama-3-8B-Instruct on an A6000 with engine_args.max_logprobs == sampling_params.logprobs == len(tokenizer.vocab.keys()) which is 128256 for llama-3. Back in v0.3.0 (the previous pinned version in my project), the rate of decrease was not so large.

After investigating, there seems to be two main bottlenecks in the inference:

  • Oddly slow detokenizing in post-processing stage in this method. This took on average 0.7s per token when sampling_params.logprobs == len(tokenizer.vocab.keys()). In my experimentation, I limited the detokenizing to the top-20 tokens based on logprobs similar to OpenAI's Chat Completion Request where top_logprobs has an upper bound of 20 and this increased the throughput to 6.03333 tokens/s. Do you think we should implement this top-n-only detokenizing logic since OpenAI also limits the top_logprobs to 20 only? I can a raise a PR for this if this is what you would would like.
  • The execution of the model has become slow and unstable when sampling_params.logprobs is large. I measured the time taken to complete this function and it is unstable with some tokens occasionally taking 0.3-0.5s instead of the "normal" 0.11s. I'm not sure what's the reason for this. I wonder if any of you encountered this before?
self.model_executor.execute_model_async: 0.2736055850982666
self.model_executor.execute_model_async: 0.2578315734863281
self.model_executor.execute_model_async: 0.2712900638580322
self.model_executor.execute_model_async: 0.11722660064697266
self.model_executor.execute_model_async: 0.28998589515686035
self.model_executor.execute_model_async: 0.11808204650878906
self.model_executor.execute_model_async: 0.30967116355895996
self.model_executor.execute_model_async: 0.11808013916015625
self.model_executor.execute_model_async: 0.34619808197021484
self.model_executor.execute_model_async: 0.11945867538452148
self.model_executor.execute_model_async: 0.11967015266418457
self.model_executor.execute_model_async: 0.4465758800506592
self.model_executor.execute_model_async: 0.12009191513061523
self.model_executor.execute_model_async: 0.11814165115356445
self.model_executor.execute_model_async: 0.11848926544189453
self.model_executor.execute_model_async: 0.43151330947875977
self.model_executor.execute_model_async: 0.12377119064331055
self.model_executor.execute_model_async: 0.11845278739929199
self.model_executor.execute_model_async: 0.11834073066711426
self.model_executor.execute_model_async: 0.5058283805847168
self.model_executor.execute_model_async: 0.11896276473999023
self.model_executor.execute_model_async: 0.11865472793579102

Your current environment (if you think it is necessary)

The output of `python collect_env.py`
@jeffrey-fong jeffrey-fong added the performance Performance-related issues label May 9, 2024
@jeffrey-fong
Copy link
Author

I just realized that there is a detokenize attribute in the SamplingParams class that is only exposed at the engine level for now. This can help me solve the first bottleneck for my use case. However, I am still trying to investigate the cause of bottleneck 2. If you know the cause of the problem or discovers it, do let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

1 participant