Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: why hf is better than vllm when using benchmark throughput #4702

Open
yuki252111 opened this issue May 9, 2024 · 4 comments
Labels
performance Performance-related issues

Comments

@yuki252111
Copy link

yuki252111 commented May 9, 2024

When I run benchmark on H800, the results are confusing. Why hf is better than vllm? Is anything wrong when I run the script?

python benchmark_throughput.py --input-len 128 --model /home/jiekong/.cache/modelscope/hub/AI-ModelScope/opt-125 --output-len 128 --max-num-batched-tokens 2048 --trust-remote-code

Throughput: 59.50 requests/s, 15231.62 tokens/s

image

python benchmark_throughput.py --input-len 128 --model /home/jiekong/.cache/modelscope/hub/AI-ModelScope/opt-125 --output-len 128 --backend hf --hf-max-batch-size 256

Throughput: 108.34 requests/s, 27736.31 tokens/s

image

@yuki252111 yuki252111 added the performance Performance-related issues label May 9, 2024
@yuki252111 yuki252111 changed the title [Performance]: why hf is [Performance]: why hf is better than vllm when using benmark throughput May 9, 2024
@yuki252111 yuki252111 changed the title [Performance]: why hf is better than vllm when using benmark throughput [Performance]: why hf is better than vllm when using benchmark throughput May 9, 2024
@mgoin
Copy link
Collaborator

mgoin commented May 9, 2024

Hi @yuki252111 when I tried this with a more realistic LLM size meta-llama/Meta-Llama-3-8B-Instruct, I see vLLM is roughly 2x faster. I used an RTX A6000 GPU.

vLLM:

python benchmark_throughput.py --input-len 128 --model meta-llama/Meta-Llama-3-8B-Instruct --output-len 128 --backend vllm

Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:59<00:00, 16.94it/s]
Throughput: 16.90 requests/s, 4326.03 tokens/s

HF:

python benchmark_throughput.py --input-len 128 --model meta-llama/Meta-Llama-3-8B-Instruct --output-len 128 --backend hf --hf-max-batch-size 128

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:52<00:00,  8.88it/s]
Throughput: 8.88 requests/s, 2272.90 tokens/s

@AlexBlack2202
Copy link

i have same issue in here, do you have any update?

@yuki252111
Copy link
Author

yuki252111 commented May 10, 2024

@AlexBlack2202
I analyzed it from the source code, and the following are my personal speculations:

  1. First of all, the two most famous features of vllm are: paged attention and continuous batching
  2. According to the benchmark script, continuous batching effective is too weak (because batch_size is too large)
    a. vllm actually adds the request to the queue, and then batches 256 sequences at a time
    b. hf is more direct, and the for loop batches 256 sequences each time
  3. The advantage of paged attention is the high utilization rate of GPU memory, which can be verified by a model with relatively large parameters

so I did two experiments.

Experiment 1 batch_size=20

this experiment will make continuous batching more effective
hf Throughput: 12.51 requests/s, 3202.18 tokens/s
vllm Throughput: 29.34 requests/s, 7510.30 tokens/s

Experiment 2 llama-7b-chat-hf batch_size=20

hf Throughput: 2.77 requests/s, 708.53 tokens/s
vllm Throughput: 12.93 requests/s, 3310.62 tokens/s
thanks to @mgoin

This result seems to make sense...

@AlexBlack2202
Copy link

@AlexBlack2202 I analyzed it from the source code, and the following are my personal speculations:

  1. First of all, the two most famous features of vllm are: paged attention and continuous batching
  2. According to the benchmark script, continuous batching effective is too weak (because batch_size is too large)
    a. vllm actually adds the request to the queue, and then batches 256 sequences at a time
    b. hf is more direct, and the for loop batches 256 sequences each time
  3. The advantage of paged attention is the high utilization rate of GPU memory, which can be verified by a model with relatively large parameters

so I did two experiments.

Experiment 1 batch_size=20

this experiment will make continuous batching more effective hf Throughput: 12.51 requests/s, 3202.18 tokens/s vllm Throughput: 29.34 requests/s, 7510.30 tokens/s

Experiment 2 llama-7b-chat-hf batch_size=20

hf Throughput: 2.77 requests/s, 708.53 tokens/s vllm Throughput: 12.93 requests/s, 3310.62 tokens/s thanks to @mgoin

This result seems to make sense...

thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

3 participants