Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Questions #7244

Open
cha-noong opened this issue May 20, 2024 · 1 comment
Open

Feature Questions #7244

cha-noong opened this issue May 20, 2024 · 1 comment

Comments

@cha-noong
Copy link

Since jetson supports triton inference server, I am considering applying it.
So, I have a few questions.

  1. In an environment where multiple AI models are run in Jetson, is there any advantage to using Triton Inference Server compared to running them individually with TensorRT? (Triton Inference Server's Queuing optimization vs. GRPC communication latency added in LocalHost)

  2. It appears that there are system shared memory and cuda shared memory as a way to reduce LocalHost communication latency. What is the difference between the two? (The document in the link talks about the same function, https://docs.nvidia.com/deeplearning/triton-inference-server/archives/triton_inference_server_1140/user-guide/docs/client_example.html)

  3. System shared memory has been confirmed to work, but cuda shared memory produces an error like the link above.
    Does Jetson currently support cuda shared memory? (Failed to register CUDA shared mem: failed to open CUDA IPC handle: invalid resource handle #5798)

@GuanLuo
Copy link
Contributor

GuanLuo commented May 22, 2024

  1. In the case of serving multiple models, Triton provides the benefit on serving those models concurrently, and you can configure the models separately depending on your use case. Triton also provides support on popular machine learning frameworks if your models are not just in TensorRT. Another benefit is that, to serve your model in TRT directly, you will need to write additional code to interact with the APIs, which Triton already does it for you, so it should require less effort to deploy a model through Triton.
  2. System shared memory is for accessing CPU memory between processes and CUDA shared memory is for GPU memory. Usually you want the data to be stored closer to the device of the model, so you would explore the CUDA shared memory if your model is deployed on GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants