Skip to content


Add vLLM to ipex-llm serving image (#10807)
Browse files Browse the repository at this point in the history
* add vllm

* done

* doc work

* fix done

* temp

* add docs

* format

* add

* fix
  • Loading branch information
gc-fu committed Apr 29, 2024
1 parent 1f876fd commit 2c64754
Show file tree
Hide file tree
Showing 12 changed files with 819 additions and 171 deletions.
29 changes: 23 additions & 6 deletions docker/llm/serving/xpu/docker/Dockerfile
Expand Up @@ -6,14 +6,31 @@ ARG https_proxy
# Disable pip's cache behavior

COPY ./ /opt/

# Install Serving Dependencies
RUN cd /llm && \
pip install --pre --upgrade ipex-llm[serving] && \
pip install transformers==4.36.2 gradio==4.19.2 && \
chmod +x /opt/
RUN cd /llm &&\
# Install ipex-llm[serving] only will update ipex_llm source code without updating
# bigdl-core-xe, which will lead to problems
apt-get update && \
apt-get install -y libfabric-dev wrk && \
pip install --pre --upgrade ipex-llm[xpu,serving] && \
pip install transformers==4.37.0 gradio==4.19.2 && \
# Install vLLM-v2 dependencies
cd /llm && \
git clone -b sycl_xpu && \
cd vllm && \
pip install -r requirements-xpu.txt && \
pip install --no-deps xformers && \
VLLM_BUILD_XPU_OPS=1 pip install --no-build-isolation -v -e . && \
pip install outlines==0.0.34 --no-deps && \
pip install interegular cloudpickle diskcache joblib lark nest-asyncio numba scipy && \
# For Qwen series models support
pip install transformers_stream_generator einops tiktoken

ADD ./ /llm/vllm-examples/
ADD ./payload-1024.lua /llm/vllm-examples/
ADD ./ /llm/vllm-examples/
ADD ./ /llm/vllm-examples/
ADD ./ /llm/fastchat-examples/

ENTRYPOINT [ "/opt/" ]
123 changes: 122 additions & 1 deletion docker/llm/serving/xpu/docker/
Expand Up @@ -43,4 +43,125 @@ root@arda-arc12:/# sycl-ls
After the container is booted, you could get into the container through `docker exec`.

To run model-serving using `IPEX-LLM` as backend, you can refer to this [document](
Currently, we provide two different serving engines in the image, which are FastChat serving engine and vLLM serving engine.

#### FastChat serving engine

To run model-serving using `IPEX-LLM` as backend using FastChat, you can refer to this [quickstart](

For convenience, we have included a file `/llm/fastchat-examples/` in the image.

You can modify this script to using fastchat with either `ipex_llm_worker` or `vllm_worker`.

#### vLLM serving engine

To run vLLM engine using `IPEX-LLM` as backend, you can refer to this [document](

We have included multiple example files in `/llm/vllm-examples`:
1. ``: Used for offline inference example
2. ``: Used for benchmarking throughput
3. `payload-1024.lua`: Used for testing request per second using 1k-128 request
4. ``: Used for template for starting vLLM service

##### Online benchmark throurgh api_server

We can benchmark the api_server to get an estimation about TPS (transactions per second). To do so, you need to start the service first according to the instructions in this [section](

In container, do the following:
1. modify the `/llm/vllm-examples/payload-1024.lua` so that the "model" attribute is correct. By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
2. Start the benchmark using `wrk` using the script below:

cd /llm/vllm-examples
# You can change -t and -c to control the concurrency.
# By default, we use 12 connections to benchmark the service.
wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h

#### Offline benchmark through

We have included the benchmark_throughput script provied by `vllm` in our image as `/llm/`. To use the benchmark_throughput script, you will need to download the test dataset through:


The full example looks like this:
cd /llm/vllm-examples



# You can change load-in-low-bit from values in [sym_int4, fp8, fp16]

python3 /llm/vllm-examples/ \
--backend vllm \
--dataset /llm/vllm-examples/ShareGPT_V3_unfiltered_cleaned_split.json \
--model $MODEL \
--num-prompts 1000 \
--seed 42 \
--trust-remote-code \
--enforce-eager \
--dtype float16 \
--device xpu \
--load-in-low-bit sym_int4 \
--gpu-memory-utilization 0.85

> Note: you can adjust --load-in-low-bit to use other formats of low-bit quantization.

You can also adjust `--gpu-memory-utilization` rate using the below script to find the best performance using the following script:


# Define the log directory
# Check if the log directory exists, if not, create it
if [ ! -d "$LOG_DIR" ]; then
mkdir -p "$LOG_DIR"

# Define an array of model paths

# Define an array of utilization rates
UTIL_RATES=(0.85 0.90 0.95)

# Loop over each model
for MODEL in "${MODELS[@]}"; do
# Loop over each utilization rate
for RATE in "${UTIL_RATES[@]}"; do
# Extract a simple model name from the path for easier identification
MODEL_NAME=$(basename "$MODEL")

# Define the log file name based on the model and rate

# Execute the command and redirect output to the log file
# Sometimes you might need to set --max-model-len if memory is not enough
# load-in-low-bit accepts inputs [sym_int4, fp8, fp16]
python3 /llm/vllm-examples/ \
--backend vllm \
--dataset /llm/vllm-examples/ShareGPT_V3_unfiltered_cleaned_split.json \
--model $MODEL \
--num-prompts 1000 \
--seed 42 \
--trust-remote-code \
--enforce-eager \
--dtype float16 \
--load-in-low-bit sym_int4 \
--device xpu \
--gpu-memory-utilization $RATE &> "$LOG_FILE"

# Inform the user that the script has completed its execution
echo "All benchmarks have been executed and logged."

0 comments on commit 2c64754

Please sign in to comment.