Improving performance of video inference and increase GPU utilization #178

h9419 · 2022-04-06T14:56:05Z

Two improvements are made in this contribution:

Removed repeated copying of background image to GPU memory, minimizing the effect of a memory bandwidth bottleneck
Increased GPU utilization by offloading the CPU video encoding to children threads as soon as it is copied to CPU memory, freeing up the parent process to begin processing the next frame
Further increased GPU utilization by offloading the CPU video decoding to another thread so that the main thread can focus on feeding the GPU

This modification allowed for about three times the performance on my system with R7 5800H and RTX3060 mobile. Using the same 4k video on both resnet50 and resnet101 models, the original version ran at 2.20it/s whereas this runs at an average of 7.5it/s.

…and threading CPU video encoding Two improvements are made in this fork: 1. Removed repeated copying of background to GPU memory 2. Minimized idle GPU time by passing video encoding work to children threads as soon as it is copied to CPU memory, allowing for higher GPU utilization.

Breaks the loop instead of exiting directly

Fixed Replaced type(int) with (int)

Fixed imports

I found out that the time CPU spent with DataLoader is another 30-40% of the execution time. I added a thread for loading data and reserved the main thread for controlling the GPU.

Added if __name__ == '__main__' so that windows recognizes Process

h9419 · 2022-04-17T09:34:10Z

Although this is faster, one major bottleneck is still in VideoDataset. When inferring on a 4k HEVC video, around 80% of the execution time is spent on VideoDataset decode. Future work can focus on using NVDEC or other GPU-accelerated video loaders.

h9419 · 2023-01-12T05:15:44Z

Although this is faster, one major bottleneck is still in VideoDataset. When inferring on a 4k HEVC video, around 80% of the execution time is spent on VideoDataset decode. Future work can focus on using NVDEC or other GPU-accelerated video loaders.

I have made a version of it to work with Nvidia's vpf library that takes advantage of nvenc and nvdec hardware accelerators for video, and directly creates GPU tensors without involving the CPU. It works inside a docker container under WSL.

However, I don't plan to publish the code since I don't think I can redistribute or publish nvenc/nvdec/x264 binaries and my glue code only works with a self compiled version when I wrote the code.

One thing I can verify is that the claimed inference speed is achievable on consumer grade GPUs, and GeForce RTX series GPU can be faster than Quadro RTX simply because of the nvenc/nvdec performance

h9419 and others added 8 commits April 6, 2022 21:47

Added termination and reaping of children processes

7a716a5

Update inference_video.py

a284cd6

Breaks the loop instead of exiting directly

Update inference_video.py

156b3b7

Fixed Replaced type(int) with (int)

Update inference_video.py

2851e27

Fixed imports

Removed junk code

dd2d3a4

Remove CPU decoding bottleneck

8102910

I found out that the time CPU spent with DataLoader is another 30-40% of the execution time. I added a thread for loading data and reserved the main thread for controlling the GPU.

Made multiprocessing also work on windows

f652510

Added if __name__ == '__main__' so that windows recognizes Process

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving performance of video inference and increase GPU utilization #178

Improving performance of video inference and increase GPU utilization #178

h9419 commented Apr 6, 2022 •

edited

h9419 commented Apr 17, 2022

h9419 commented Jan 12, 2023

Improving performance of video inference and increase GPU utilization #178

Are you sure you want to change the base?

Improving performance of video inference and increase GPU utilization #178

Conversation

h9419 commented Apr 6, 2022 • edited

h9419 commented Apr 17, 2022

h9419 commented Jan 12, 2023

h9419 commented Apr 6, 2022 •

edited