-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to conduct in8 quantilization and calibration in Python? #3858
Comments
By the way, I use TensorRT 8.4.1, dose the calibration API in Python not work? Hoping someone can give some help! Many thanks. [05/15/2024-15:50:14] [TRT] [I] Starting Calibration. |
You can make use of polygraphy, see https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/examples/cli/convert/01_int8_calibration_in_tensorrt |
@zerollzeng Thanks a lot for the support. By the way, can the engine build with EngineFromNetwork API be saved to disk, get the new quantilized TensorRT engine file? I use the following code to generate the TensorRT engine:
But why the new engine file size is not 1/4 of the old one generated with float32? From 156M to 95M, the onnx file size is 153M, what's wrong? Is the save engine code right? |
@zerollzeng Sorry for the bother again, I tried to use trtexec tool to generate INT8 quantilize engine without calibration like this:
The quantilized TensorRT engine size become 51M, why it is much smaller than the engine generated with polygraphy? Because the latter contains Q/DQ layers? And I test the inference speed of FP32 engine and INT8 engine, it almost the same, what's wrong? I test them on the A100 GPU. |
Many factor affect the final engine size. I don't have a clear conclusion in your case.
I guess sub-optimal Q/DQ placement. You can check the engine layer information in verbose log or check the layer profile(see trtexec -h) to confirm. You can take PTQ as best perf goal. use model without Q/DQ and build with --best to see how good the perf is. |
Hi, all, I'm tring to convert an onnx model to TensorRT with INT8 quantilization in Python environment, here is the code:
The model have two input tensor("xs" and "xlen"), they have dynamic input shape, when I run this script, it always give the following error:
[05/13/2024-17:39:19] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:367: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
Building an engine from file ./onnx_model/model.onnx, this may take a while...
quantilize.py:173: DeprecationWarning: Use build_serialized_network instead.
engine = builder.build_engine(network, config)
[05/13/2024-17:39:26] [TRT] [W] Calibration Profile is not defined. Running calibration with Profile 0
[05/13/2024-17:39:26] [TRT] [W] Calibration Profile is not defined. Running calibration with Profile 0
[ERROR] Exception caught in get_batch(): Unable to cast Python instance to C++ type (compile in debug mode for details)
[05/13/2024-17:39:44] [TRT] [E] 1: Unexpected exception _Map_base::at
Failed to create the engine
What's wrong? Is there any error in my code? How can I fix this error and successful finish this job? Anyone can give some helps? Thanks a lot in advance!!!
The text was updated successfully, but these errors were encountered: