Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous batching #1333

Open
andreapiso opened this issue Jul 6, 2023 · 6 comments
Open

Continuous batching #1333

andreapiso opened this issue Jul 6, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@andreapiso
Copy link

Recently, a lot of benchmarks point to the fact that if you want to serve your models behind an API, continuous batching grants higher throughput and lower latency compared to static batching. Some examples of systems that implement continous batching:

In order to enable continuous batching, it is necessary to be able to:

  1. add requests to an existing running batch, if there are enough resources to take it (compared to static batching where requests need to be submitted all together)
  2. remove a request early from the batch when it reaches the stop token (as opposed to returning all requests at the same time).

Is this concept compatible with CTranslate2 architecture? I am keen to build an inference engine on top of CTranslate2, would love to hear some thoughts around this before I deep dive into it.

@michaelfeil
Copy link
Contributor

#1317

@andreapiso
Copy link
Author

@michaelfeil is this related? Yes, vLLM supports continuous batching, but I'm looking to understand if Ctranslate can be extended to support that, without using vLLM.

@guillaumekln
Copy link
Collaborator

guillaumekln commented Jul 7, 2023

  1. Currently it is not possible to add an entry to a batch that is already running. However, you could bufferize incoming requests and batch them together before calling CTranslate2. I think this is already good enough in many situations.
  2. This is already possible. There is a callback parameter to get tokens as soon as they are generated, and finished requests are removed from the batch.

@andreapiso
Copy link
Author

Yes, bufferize incoming requests and sending them together is what i meant for static batching.

Is 1. not possible today because of a difference in architecture between CT2 and HF Transformers, or is it possible in theory, but the mechanism has not been implemented?

@guillaumekln
Copy link
Collaborator

CT2 was not designed with the feature in mind so it is not trivial to implement it. But of course it is possible in theory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants
@guillaumekln @andreapiso @michaelfeil and others