Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : add optional CPU backend context, support reusing threads, async compute #721

Open
slaren opened this issue Feb 1, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@slaren
Copy link
Collaborator

slaren commented Feb 1, 2024

As recently seen in llama.cpp (ggerganov/llama.cpp#5226), the cost of starting the threads of the CPU backend is not insignificant. To address this, I propose adding a new CPU context object that holds the threads and can reuse them between invocations. Additionally, this CPU context would behave as an asynchronous queue, so that multiple graph evaluations could be queued into the object. This would enable the implementation of pipeline parallelism with the CPU and GPU backends (ref: ggerganov/llama.cpp#4918 (comment)).

Possible API:

ggml_compute_context_t ggml_compute_context_init(int n_threads);
void ggml_graph_compute_async(ggml_compute_context_t context, struct ggml_cgraph * graph);
void ggml_synchronize(ggml_compute_context_t context);
@ggerganov ggerganov added the enhancement New feature or request label Feb 1, 2024
@ggerganov
Copy link
Owner

Would the threads wait on a condition variable while not running? I've done some testing in the past to maintain a global pool of threads and wake them when there is work (ggerganov/whisper.cpp#343). This didn't seem to help the performance much, but it's possible that the implementation was not done ideally

Regardless if there is performance gain or not, the rest of the functionality that would be enabled is worth it alone

@slaren
Copy link
Collaborator Author

slaren commented Feb 1, 2024

Yes, the threads would wait on a condition variable or something to the same effect. In linux and possibly macOS, the overhead of creating a thread and waking a blocked thread is probably close enough that for large graphs it wouldn't make much difference, but for very small graphs like the ones used often by ggml_backend_sched it may be significant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

2 participants