Thread safety #499

SpeedyCraftah · 2023-03-25T15:32:41Z

SpeedyCraftah
Mar 25, 2023

Is llama.cpp thread safe? I have encountered some problems and weird issues when creating a CTX on another thread and then using it in another.

Answered by Ronsor

Mar 25, 2023

See #370 (comment)

tl;dr not yet, but it's a priority and parallel inference is on the roadmap.

View full answer

Ronsor · 2023-03-25T18:27:19Z

Ronsor
Mar 25, 2023

See #370 (comment)

tl;dr not yet, but it's a priority and parallel inference is on the roadmap.

0 replies

pseudotensor · 2023-11-04T17:28:41Z

pseudotensor
Nov 4, 2023

Would be nice if llama.cpp was thread safe h2oai/h2ogpt#1017

4 replies

ggerganov Nov 5, 2023
Maintainer

Not clear what this means. llama.cpp support loading multiple models and creating multiple inference contexts. It also supports parallel decoding. You don't need thread safe API to do that.

pseudotensor Nov 5, 2023

Sorry if I'm confused or doing something wrong, but if I run 2 llama.cpp -based models at same time, started up independently, using llama_cpp_python, then when using separate threads to stream them back to me, I get a segfaults and other bad behavior.

slaren Nov 5, 2023
Collaborator

I think it is important that llama.cpp is thread safe, even if it is not a big priority at the moment. There are cases where we might want to use multiple contexts simultaneously on different threads that the batched decoding implementation doesn't cover. We might want to use multiple devices, or multiple small models simultaneously. Currently, the CUDA backend is most definitely not thread safe, but other than that, llama.cpp shouldn't be too far from being thread safe over different llama_context objects.

ggerganov Nov 5, 2023
Maintainer

Tracked here: #3960

mirekphd · 2024-04-20T16:25:51Z

mirekphd
Apr 20, 2024

CUDA backend is most definitely not thread safe,

And this affects package stability also in the special case of CPU-only inference. I get segfaults during concurrent inference attempts (using Streamlit) even on CPU-only machines, where these errors are especially easy to reproduce given how long the inference takes in CPU alone. It can be reproduced in the official nvidia/cuda containers (or derived ones), where CUDA libraries required by llama.cpp are shipped inside the Docker containers, in the the /usr/local/cuda-11.8/compat folder, where NVIDIA places their CUDA forward compat. packages. These are used by llama.cpp even for CPU-only inference, with zero layers offloaded to the GPU, being called from llama_cpp/libllama.so library even on CPU-only machines without the GPU driver installed on the host.

I know it is a bit of a narrow and specialized case, but maybe llama.cpp could avoid this dependency on CUDA libraries altogether when zero layers are about to be offloaded to the CPU? That way we could achieve thread-safety at least for the CPU-only inference case. Note that it would probably require a CPU-only version of some / all files, because currently the number of desired model layers offloaded to the GPU is set very late - in Python it can be changed on the instantiated Llama class, while for avoiding dependency on CUDA it would arguably need to be set at the time of the package or model import.

1 reply

slaren Apr 20, 2024
Collaborator

The CUDA backend should be thread safe now, and if there are any issues left they should be reported as bugs. You can disable the CUDA backend by setting the CUDA_VISIBLE_DEVICES environment variable to an empty string, otherwise CUDA will still be used for prompt processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread safety #499

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Thread safety #499

SpeedyCraftah Mar 25, 2023

Replies: 3 comments · 5 replies

Ronsor Mar 25, 2023

pseudotensor Nov 4, 2023

ggerganov Nov 5, 2023 Maintainer

pseudotensor Nov 5, 2023

slaren Nov 5, 2023 Collaborator

ggerganov Nov 5, 2023 Maintainer

mirekphd Apr 20, 2024

slaren Apr 20, 2024 Collaborator

SpeedyCraftah
Mar 25, 2023

Replies: 3 comments 5 replies

Ronsor
Mar 25, 2023

pseudotensor
Nov 4, 2023

ggerganov Nov 5, 2023
Maintainer

slaren Nov 5, 2023
Collaborator

ggerganov Nov 5, 2023
Maintainer

mirekphd
Apr 20, 2024

slaren Apr 20, 2024
Collaborator