Testing TensorRT-LLM backend

Tests in this CI directory can be run manually to provide extensive testing.

Run QA Tests

Before the Triton 23.10 release, you can launch the Triton 23.09 container nvcr.io/nvidia/tritonserver:23.09-py3 and add the directory /opt/tritonserver/backends/tensorrtllm within the container following the instructions in Option 3 Build via CMake.

Run the testing within the Triton container.

docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/opt/tritonserver/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash

# Change directory to the test and run the test.sh script
cd /opt/tritonserver/tensorrtllm_backend/ci/<test directory>
bash -x ./test.sh

Run the e2e/benchmark_core_model to benchmark

These two tests are ran in the L0_backend_trtllm test. Below are the instructions to run the tests manually.

Generate the model repository

Follow the instructions in the Create the model repository section to prepare the model repository.

Modify the model configuration

Follow the instructions in the Modify the model configuration section to modify the model configuration based on the needs.

End to end test

End to end test script sends requests to the deployed ensemble model.

Ensemble model is ensembled by three models: preprocessing, tensorrt_llm and postprocessing:

"preprocessing": This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
"tensorrt_llm": This model is a wrapper of your TensorRT-LLM model and is used for inferencing
"postprocessing": This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).

The end to end latency includes the total latency of the three parts of an ensemble model.

cd tools/inflight_batcher_llm
python3 end_to_end_test.py --dataset <dataset path>

Expected outputs

[INFO] Functionality test succeed.
[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 125 prompts.
[INFO] Total Latency: 11099.243 ms

benchmark_core_model

benchmark_core_model script sends requests directly to the deployed tensorrt_llm model, the benchmark_core_model latency indicates the inference latency of TensorRT-LLM, not including the pre/post-processing latency which is usually handled by a third-party library such as HuggingFace.

cd tools/inflight_batcher_llm
python3 benchmark_core_model.py dataset --dataset <dataset path>

Expected outputs

[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 125 prompts.
[INFO] Total Latency: 10213.462 ms

Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Testing TensorRT-LLM backend

Run QA Tests

Run the e2e/benchmark_core_model to benchmark

Generate the model repository

Modify the model configuration

End to end test

benchmark_core_model

Files

README.md

Latest commit

History

README.md

File metadata and controls

Testing TensorRT-LLM backend

Run QA Tests

Run the e2e/benchmark_core_model to benchmark

Generate the model repository

Modify the model configuration

End to end test

benchmark_core_model