Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vLLM to ipex-llm serving image #10807

Merged
merged 9 commits into from Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
29 changes: 23 additions & 6 deletions docker/llm/serving/xpu/docker/Dockerfile
Expand Up @@ -6,14 +6,31 @@ ARG https_proxy
# Disable pip's cache behavior
ARG PIP_NO_CACHE_DIR=false

COPY ./entrypoint.sh /opt/entrypoint.sh

# Install Serving Dependencies
RUN cd /llm && \
pip install --pre --upgrade ipex-llm[serving] && \
pip install transformers==4.36.2 gradio==4.19.2 && \
chmod +x /opt/entrypoint.sh
RUN cd /llm &&\
# Install ipex-llm[serving] only will update ipex_llm source code without updating
# bigdl-core-xe, which will lead to problems
apt-get update && \
apt-get install -y libfabric-dev wrk && \
pip install --pre --upgrade ipex-llm[xpu,serving] && \
pip install transformers==4.37.0 gradio==4.19.2 && \
# Install vLLM-v2 dependencies
cd /llm && \
git clone -b sycl_xpu https://github.com/analytics-zoo/vllm.git && \
cd vllm && \
pip install -r requirements-xpu.txt && \
pip install --no-deps xformers && \
VLLM_BUILD_XPU_OPS=1 pip install --no-build-isolation -v -e . && \
pip install outlines==0.0.34 --no-deps && \
pip install interegular cloudpickle diskcache joblib lark nest-asyncio numba scipy && \
# For Qwen series models support
pip install transformers_stream_generator einops tiktoken

ADD ./offline_inference.py /llm/vllm-examples/
ADD ./payload-1024.lua /llm/vllm-examples/
ADD ./start-vllm-service.sh /llm/vllm-examples/
ADD ./benchmark_throughput.py /llm/vllm-examples/
ADD ./start-fastchat-service.sh /llm/fastchat-examples/

WORKDIR /llm/
ENTRYPOINT [ "/opt/entrypoint.sh" ]
123 changes: 122 additions & 1 deletion docker/llm/serving/xpu/docker/README.md
Expand Up @@ -43,4 +43,125 @@ root@arda-arc12:/# sycl-ls
```
After the container is booted, you could get into the container through `docker exec`.

To run model-serving using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/IPEX-LLM/tree/main/python/llm/src/ipex_llm/serving).
Currently, we provide two different serving engines in the image, which are FastChat serving engine and vLLM serving engine.

#### FastChat serving engine

To run model-serving using `IPEX-LLM` as backend using FastChat, you can refer to this [quickstart](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#).

For convenience, we have included a file `/llm/fastchat-examples/start-fastchat-service.sh` in the image.

You can modify this script to using fastchat with either `ipex_llm_worker` or `vllm_worker`.

#### vLLM serving engine

To run vLLM engine using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md).

We have included multiple example files in `/llm/vllm-examples`:
1. `offline_inference.py`: Used for offline inference example
2. `benchmark_throughput.py`: Used for benchmarking throughput
3. `payload-1024.lua`: Used for testing request per second using 1k-128 request
4. `start-vllm-service.sh`: Used for template for starting vLLM service

##### Online benchmark throurgh api_server

We can benchmark the api_server to get an estimation about TPS (transactions per second). To do so, you need to start the service first according to the instructions in this [section](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md#service).


In container, do the following:
1. modify the `/llm/vllm-examples/payload-1024.lua` so that the "model" attribute is correct. By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
2. Start the benchmark using `wrk` using the script below:

```bash
cd /llm/vllm-examples
# You can change -t and -c to control the concurrency.
# By default, we use 12 connections to benchmark the service.
wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h

```
#### Offline benchmark through benchmark_throughput.py

We have included the benchmark_throughput script provied by `vllm` in our image as `/llm/benchmark_throughput.py`. To use the benchmark_throughput script, you will need to download the test dataset through:

```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

The full example looks like this:
```bash
cd /llm/vllm-examples

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

export MODEL="YOUR_MODEL"

# You can change load-in-low-bit from values in [sym_int4, fp8, fp16]

python3 /llm/vllm-examples/benchmark_throughput.py \
--backend vllm \
--dataset /llm/vllm-examples/ShareGPT_V3_unfiltered_cleaned_split.json \
--model $MODEL \
--num-prompts 1000 \
--seed 42 \
--trust-remote-code \
--enforce-eager \
--dtype float16 \
--device xpu \
--load-in-low-bit sym_int4 \
--gpu-memory-utilization 0.85
```

> Note: you can adjust --load-in-low-bit to use other formats of low-bit quantization.


You can also adjust `--gpu-memory-utilization` rate using the below script to find the best performance using the following script:

```bash
#!/bin/bash

# Define the log directory
LOG_DIR="YOUR_LOG_DIR"
# Check if the log directory exists, if not, create it
if [ ! -d "$LOG_DIR" ]; then
mkdir -p "$LOG_DIR"
fi

# Define an array of model paths
MODELS=(
"YOUR TESTED MODELS"
)

# Define an array of utilization rates
UTIL_RATES=(0.85 0.90 0.95)

# Loop over each model
for MODEL in "${MODELS[@]}"; do
# Loop over each utilization rate
for RATE in "${UTIL_RATES[@]}"; do
# Extract a simple model name from the path for easier identification
MODEL_NAME=$(basename "$MODEL")

# Define the log file name based on the model and rate
LOG_FILE="$LOG_DIR/${MODEL_NAME}_utilization_${RATE}.log"

# Execute the command and redirect output to the log file
# Sometimes you might need to set --max-model-len if memory is not enough
# load-in-low-bit accepts inputs [sym_int4, fp8, fp16]
python3 /llm/vllm-examples/benchmark_throughput.py \
--backend vllm \
--dataset /llm/vllm-examples/ShareGPT_V3_unfiltered_cleaned_split.json \
--model $MODEL \
--num-prompts 1000 \
--seed 42 \
--trust-remote-code \
--enforce-eager \
--dtype float16 \
--load-in-low-bit sym_int4 \
--device xpu \
--gpu-memory-utilization $RATE &> "$LOG_FILE"
done
done

# Inform the user that the script has completed its execution
echo "All benchmarks have been executed and logged."
```