Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting engine file from onnx file with ReduceMax failure of TensorRT 8.5.10 when running trtexec on GPU Orin #3866

Open
JYS997760473 opened this issue May 15, 2024 · 7 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@JYS997760473
Copy link

JYS997760473 commented May 15, 2024

Description

I tried to generate engine file from onnx file on Orin GPU, but it failed:
[05/15/2024-11:45:16] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +4, now: CPU 0, GPU 4 (MiB)
[05/15/2024-11:45:16] [E] Saving engine to file failed.
[05/15/2024-11:45:16] [E] Engine set up failed

Environment

TensorRT Version:

NVIDIA GPU:

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

@lix19937
Copy link

Please add --verbose to get more detailed log.

@JYS997760473
Copy link
Author

Please add --verbose to get more detailed log.

Hi, I replaced the original nn.Layernorm block by nn.BatchNormalization block. Now my new network onnx file is :
image
According to the docucment: "https://github.com/NVIDIA/Deep-Learning-Accelerator-SW/tree/main/operators", BatchNormalizaion operator is supported native by Nvidia DLA, but when I try to generate engine file from the onnx file, I still failed. The end part of log is here:

[05/15/2024-20:44:50] [V] [TRT] Layer: MaxPool_5 Host Persistent: 1408 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: Gemm_12 Host Persistent: 6752 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: Gemm_13 || Gemm_14 Host Persistent: 5664 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: Gemm_15 Host Persistent: 6752 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: PWN(onnx::Div_41 + (Unnamed Layer* 33) [Shuffle], Div_17) Host Persistent: 244 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: Gemm_19 Host Persistent: 6048 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: Gemm_20 Host Persistent: 6048 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: Gemm_21 Host Persistent: 6048 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Skipped printing memory information for 22 layers with 0 memory size i.e. Host Persistent + Device Persistent + Scratch Memory == 0.
[05/15/2024-20:44:50] [I] [TRT] Total Host Persistent Memory: 45280
[05/15/2024-20:44:50] [I] [TRT] Total Device Persistent Memory: 0
[05/15/2024-20:44:50] [I] [TRT] Total Scratch Memory: 0
[05/15/2024-20:44:50] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 132 MiB
[05/15/2024-20:44:50] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 29 steps to complete.
[05/15/2024-20:44:50] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.337024ms to assign 7 blocks to 29 nodes requiring 126464 bytes.
[05/15/2024-20:44:50] [V] [TRT] Total number of blocks in optimized block assignment: 7
[05/15/2024-20:44:50] [I] [TRT] Total Activation Memory: 126464
[05/15/2024-20:44:50] [V] [TRT] Finalize: MatMul_0 Set kernel index: 0
[05/15/2024-20:44:50] [V] [TRT] Finalize: MaxPool_5 Set kernel index: 1
[05/15/2024-20:44:50] [V] [TRT] Finalize: Gemm_12 Set kernel index: 2
[05/15/2024-20:44:50] [V] [TRT] Finalize: Gemm_13 || Gemm_14 Set kernel index: 3
[05/15/2024-20:44:50] [V] [TRT] Finalize: Gemm_15 Set kernel index: 2
[05/15/2024-20:44:50] [V] [TRT] Finalize: PWN(onnx::Div_41 + (Unnamed Layer* 33) [Shuffle], Div_17) Set kernel index: 4
[05/15/2024-20:44:50] [V] [TRT] Finalize: Gemm_19 Set kernel index: 5
[05/15/2024-20:44:50] [V] [TRT] Finalize: Gemm_20 Set kernel index: 6
[05/15/2024-20:44:50] [V] [TRT] Finalize: Gemm_21 Set kernel index: 6
[05/15/2024-20:44:50] [V] [TRT] Total number of generated kernels selected for the engine: 7
[05/15/2024-20:44:50] [V] [TRT] Kernel: 0 CASK_STATIC
[05/15/2024-20:44:50] [V] [TRT] Kernel: 1 CASK_STATIC
[05/15/2024-20:44:50] [V] [TRT] Kernel: 2 CASK_STATIC
[05/15/2024-20:44:50] [V] [TRT] Kernel: 3 CASK_STATIC
[05/15/2024-20:44:50] [V] [TRT] Kernel: 4 TRT_SERIALIZABLE:generatedNativePointwise
[05/15/2024-20:44:50] [V] [TRT] Kernel: 5 CASK_STATIC
[05/15/2024-20:44:50] [V] [TRT] Kernel: 6 CASK_STATIC
[05/15/2024-20:44:50] [V] [TRT] Disabling unused tactic source: CUDNN
[05/15/2024-20:44:50] [V] [TRT] Disabling unused tactic source: CUBLAS, CUBLAS_LT
[05/15/2024-20:44:50] [V] [TRT] Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS
[05/15/2024-20:44:50] [V] [TRT] Disabling unused tactic source: JIT_CONVOLUTIONS
[05/15/2024-20:44:50] [V] [TRT] Engine generation completed in 10.7422 seconds.
[05/15/2024-20:44:50] [V] [TRT] Deleting timing cache: 141 entries, served 42 hits since creation.
[05/15/2024-20:44:50] [V] [TRT] Engine Layer Information:
Layer(NoOp): reshape_before_MatMul_0, Tactic: 0x0000000000000000, x (Float[12,20,12]) -> reshape_before_MatMul_0_out_tensor (Float[240,12,1,1])
Layer(NoOp): Reformatting CopyNode for Input Tensor 0 to MatMul_0, Tactic: 0x0000000000000000, reshape_before_MatMul_0_out_tensor (Float[240,12,1,1]) -> Reformatted Input Tensor 0 to MatMul_0 (Float[240,12:4,1,1])
Layer(CaskGemmConvolution): MatMul_0, Tactic: 0x00000000000201d1, Reformatted Input Tensor 0 to MatMul_0 (Float[240,12:4,1,1]) -> MatMul_0_out_tensor (Float[240,64:4,1,1])
Layer(NoOp): Reformatting CopyNode for Input Tensor 0 to reshape_after_MatMul_0, Tactic: 0x0000000000000000, MatMul_0_out_tensor (Float[240,64:4,1,1]) -> Reformatted Input Tensor 0 to reshape_after_MatMul_0 (Float[240,64,1,1])
Layer(NoOp): reshape_after_MatMul_0, Tactic: 0x0000000000000000, Reformatted Input Tensor 0 to reshape_after_MatMul_0 (Float[240,64,1,1]) -> onnx::Add_25 (Float[12,20,64])
Layer(Constant): backbone.subgraph.linear.bias + (Unnamed Layer* 4) [Shuffle], Tactic: 0x0000000000000000,  -> (Unnamed Layer* 4) [Shuffle]_output (Float[1,1,64])
Layer(ElementWise): Add_1, Tactic: 0x0000000000000001, (Unnamed Layer* 4) [Shuffle]_output (Float[1,1,64]), onnx::Add_25 (Float[12,20,64]) -> input (Float[12,20,64])
Layer(NoOp): (Unnamed Layer* 6) [Shuffle], Tactic: 0x0000000000000000, input (Float[12,20,64]) -> (Unnamed Layer* 6) [Shuffle]_output (Float[12,20,64,1])
Layer(Scale): BatchNormalization_2 + Relu_3, Tactic: 0x0000000000000000, (Unnamed Layer* 6) [Shuffle]_output (Float[12,20,64,1]) -> Relu_3_out_tensor (Float[12,20,64,1])
Layer(NoOp): squeeze_after_Relu_3, Tactic: 0x0000000000000000, Relu_3_out_tensor (Float[12,20,64,1]) -> squeeze_after_Relu_3_out_tensor (Float[12,20,64])
Layer(Shuffle): Transpose_4 + (Unnamed Layer* 11) [Shuffle], Tactic: 0x0000000000000000, squeeze_after_Relu_3_out_tensor (Float[12,20,64]) -> (Unnamed Layer* 11) [Shuffle]_output (Float[12,64,20,1])
Layer(CaskPooling): MaxPool_5, Tactic: 0x5faf4a0a8a5670ed, (Unnamed Layer* 11) [Shuffle]_output (Float[12,64,20,1]) -> (Unnamed Layer* 12) [Pooling]_output (Float[12,64,1,1])
Layer(NoOp): (Unnamed Layer* 13) [Shuffle] + Squeeze_6, Tactic: 0x0000000000000000, (Unnamed Layer* 12) [Pooling]_output (Float[12,64,1,1]) -> x.1 (Float[12,64])
Layer(Reformat): reshape_before_Gemm_12_copy_input, Tactic: 0x00000000000003e8, x.1 (Float[1,64]) -> reshape_before_Gemm_12_copy_input (Float[1,64])
Layer(NoOp): reshape_before_Gemm_12, Tactic: 0x0000000000000000, reshape_before_Gemm_12_copy_input (Float[1,64]) -> reshape_before_Gemm_12_out_tensor (Float[1,64,1,1])
Layer(CaskGemmConvolution): Gemm_12, Tactic: 0x000000000002034f, reshape_before_Gemm_12_out_tensor (Float[1,64,1,1]) -> Gemm_12_out_tensor (Float[1,32,1,1])
Layer(NoOp): reshape_after_Gemm_12, Tactic: 0x0000000000000000, Gemm_12_out_tensor (Float[1,32,1,1]) -> onnx::Gemm_37 (Float[1,32])
Layer(NoOp): reshape_before_Gemm_13, Tactic: 0x0000000000000000, x.1 (Float[12,64]) -> reshape_before_Gemm_13_out_tensor (Float[12,64,1,1])
Layer(CaskGemmConvolution): Gemm_13 || Gemm_14, Tactic: 0x00000000000204df, reshape_before_Gemm_13_out_tensor (Float[12,64,1,1]) -> Gemm_13 || Gemm_14 (Float[12,64,1,1])
Layer(Reformat): reshape_after_Gemm_13_copy_input, Tactic: 0x00000000000003e8, Gemm_13 || Gemm_14 (Float[12,32,1,1]) -> reshape_after_Gemm_13_copy_input (Float[12,32,1,1])
Layer(NoOp): reshape_after_Gemm_13, Tactic: 0x0000000000000000, reshape_after_Gemm_13_copy_input (Float[12,32,1,1]) -> onnx::Gemm_38 (Float[12,32])
Layer(Reformat): reshape_after_Gemm_14_copy_input, Tactic: 0x00000000000003e8, Gemm_13 || Gemm_14 (Float[12,32,1,1]) -> reshape_after_Gemm_14_copy_input (Float[12,32,1,1])
Layer(NoOp): reshape_after_Gemm_14, Tactic: 0x0000000000000000, reshape_after_Gemm_14_copy_input (Float[12,32,1,1]) -> onnx::Gemm_39 (Float[12,32])
Layer(CaskGemmMatrixMultiply): Gemm_15, Tactic: 0x000000000002034f, onnx::Gemm_37 (Float[1,32]), onnx::Gemm_38 (Float[12,32]) -> onnx::Div_40 (Float[1,12])
Layer(PointWiseV2): PWN(onnx::Div_41 + (Unnamed Layer* 33) [Shuffle], Div_17), Tactic: 0x000000000000001c, onnx::Div_40 (Float[1,12]) -> scores (Float[1,12])
Layer(CudaSoftMax): Softmax_18, Tactic: 0x00000000000003e9, scores (Float[1,12]) -> (Unnamed Layer* 36) [Softmax]_output (Float[1,12])
Layer(CaskGemmMatrixMultiply): Gemm_19, Tactic: 0x00000000000203be, (Unnamed Layer* 36) [Softmax]_output (Float[1,12]), onnx::Gemm_39 (Float[12,32]) -> onnx::Gemm_44 (Float[1,32])
Layer(NoOp): reshape_before_Gemm_20, Tactic: 0x0000000000000000, onnx::Gemm_44 (Float[1,32]) -> reshape_before_Gemm_20_out_tensor (Float[1,32,1,1])
Layer(CaskGemmConvolution): Gemm_20, Tactic: 0x000000000002014b, reshape_before_Gemm_20_out_tensor (Float[1,32,1,1]) -> Gemm_20_out_tensor (Float[1,32,1,1])
Layer(CaskGemmConvolution): Gemm_21, Tactic: 0x000000000002014b, Gemm_20_out_tensor (Float[1,32,1,1]) -> Gemm_21_out_tensor (Float[1,30,1,1])
Layer(NoOp): reshape_after_Gemm_21, Tactic: 0x0000000000000000, Gemm_21_out_tensor (Float[1,30,1,1]) -> reg (Float[1,30])
[05/15/2024-20:44:50] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +4, now: CPU 0, GPU 4 (MiB)
[05/15/2024-20:44:50] [E] Saving engine to file failed.
[05/15/2024-20:44:50] [E] Engine set up failed

Please check and have a nice day

@JYS997760473
Copy link
Author

And if I remove the LayerNorm or BatchNormlization block, can success to generate the engine file.

@lix19937
Copy link

lix19937 commented May 17, 2024

You can try to convert these two modules(LayerNorm or BatchNormlization block as a subgraph onnx) separately.

@zerollzeng
Copy link
Collaborator

[05/15/2024-20:44:50] [E] Saving engine to file failed.

no disk space?

@zerollzeng zerollzeng self-assigned this May 17, 2024
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label May 17, 2024
@JYS997760473
Copy link
Author

[05/15/2024-20:44:50] [E] Saving engine to file failed.

no disk space?

Hi, thanks for your reply. I tried again with new .pt, and success to create the engine file.
And there is one more thing I would like to make clear, is that up till now, we cannot use LayerNormalization operator on Orin DRIVE unless write a TensorRT Plugin by myself?

@zerollzeng
Copy link
Collaborator

Please check our release note, I think you need at least TRT 8.6 or 9.0, can't remember exactly which one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants