[Performance]: why hf is better than vllm when using benchmark throughput #4702

yuki252111 · 2024-05-09T06:32:31Z

When I run benchmark on H800, the results are confusing. Why hf is better than vllm? Is anything wrong when I run the script?

python benchmark_throughput.py --input-len 128 --model /home/jiekong/.cache/modelscope/hub/AI-ModelScope/opt-125 --output-len 128 --max-num-batched-tokens 2048 --trust-remote-code

Throughput: 59.50 requests/s, 15231.62 tokens/s

python benchmark_throughput.py --input-len 128 --model /home/jiekong/.cache/modelscope/hub/AI-ModelScope/opt-125 --output-len 128 --backend hf --hf-max-batch-size 256

Throughput: 108.34 requests/s, 27736.31 tokens/s

The text was updated successfully, but these errors were encountered:

mgoin · 2024-05-09T18:53:05Z

Hi @yuki252111 when I tried this with a more realistic LLM size meta-llama/Meta-Llama-3-8B-Instruct, I see vLLM is roughly 2x faster. I used an RTX A6000 GPU.

vLLM:

python benchmark_throughput.py --input-len 128 --model meta-llama/Meta-Llama-3-8B-Instruct --output-len 128 --backend vllm

Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:59<00:00, 16.94it/s]
Throughput: 16.90 requests/s, 4326.03 tokens/s

HF:

python benchmark_throughput.py --input-len 128 --model meta-llama/Meta-Llama-3-8B-Instruct --output-len 128 --backend hf --hf-max-batch-size 128

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:52<00:00,  8.88it/s]
Throughput: 8.88 requests/s, 2272.90 tokens/s

AlexBlack2202 · 2024-05-10T04:13:03Z

i have same issue in here, do you have any update?

yuki252111 · 2024-05-10T06:47:57Z

@AlexBlack2202
I analyzed it from the source code, and the following are my personal speculations:

First of all, the two most famous features of vllm are: paged attention and continuous batching
According to the benchmark script, continuous batching effective is too weak (because batch_size is too large)
a. vllm actually adds the request to the queue, and then batches 256 sequences at a time
b. hf is more direct, and the for loop batches 256 sequences each time
The advantage of paged attention is the high utilization rate of GPU memory, which can be verified by a model with relatively large parameters

so I did two experiments.

Experiment 1 batch_size=20

this experiment will make continuous batching more effective
hf Throughput: 12.51 requests/s, 3202.18 tokens/s
vllm Throughput: 29.34 requests/s, 7510.30 tokens/s

Experiment 2 llama-7b-chat-hf batch_size=20

hf Throughput: 2.77 requests/s, 708.53 tokens/s
vllm Throughput: 12.93 requests/s, 3310.62 tokens/s
thanks to @mgoin

This result seems to make sense...

AlexBlack2202 · 2024-05-11T01:53:42Z

@AlexBlack2202 I analyzed it from the source code, and the following are my personal speculations:

First of all, the two most famous features of vllm are: paged attention and continuous batching

According to the benchmark script, continuous batching effective is too weak (because batch_size is too large)
a. vllm actually adds the request to the queue, and then batches 256 sequences at a time
b. hf is more direct, and the for loop batches 256 sequences each time

The advantage of paged attention is the high utilization rate of GPU memory, which can be verified by a model with relatively large parameters

so I did two experiments.

Experiment 1 batch_size=20

this experiment will make continuous batching more effective hf Throughput: 12.51 requests/s, 3202.18 tokens/s vllm Throughput: 29.34 requests/s, 7510.30 tokens/s

Experiment 2 llama-7b-chat-hf batch_size=20

hf Throughput: 2.77 requests/s, 708.53 tokens/s vllm Throughput: 12.93 requests/s, 3310.62 tokens/s thanks to @mgoin

This result seems to make sense...

thank you very much

yuki252111 added the performance Performance-related issues label May 9, 2024

yuki252111 changed the title ~~[Performance]: why hf is~~ [Performance]: why hf is better than vllm when using benmark throughput May 9, 2024

yuki252111 changed the title ~~[Performance]: why hf is better than vllm when using benmark throughput~~ [Performance]: why hf is better than vllm when using benchmark throughput May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: why hf is better than vllm when using benchmark throughput #4702

[Performance]: why hf is better than vllm when using benchmark throughput #4702

yuki252111 commented May 9, 2024 •

edited

mgoin commented May 9, 2024

AlexBlack2202 commented May 10, 2024

yuki252111 commented May 10, 2024 •

edited

AlexBlack2202 commented May 11, 2024

Experiment 1 batch_size=20

Experiment 2 llama-7b-chat-hf batch_size=20

[Performance]: why hf is better than vllm when using benchmark throughput #4702

[Performance]: why hf is better than vllm when using benchmark throughput #4702

Comments

yuki252111 commented May 9, 2024 • edited

mgoin commented May 9, 2024

AlexBlack2202 commented May 10, 2024

yuki252111 commented May 10, 2024 • edited

Experiment 1 batch_size=20

Experiment 2 llama-7b-chat-hf batch_size=20

AlexBlack2202 commented May 11, 2024

Experiment 1 batch_size=20

Experiment 2 llama-7b-chat-hf batch_size=20

yuki252111 commented May 9, 2024 •

edited

yuki252111 commented May 10, 2024 •

edited