Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

token generation speed decreases when using grammar #810

Open
abeiro opened this issue Apr 25, 2024 · 1 comment
Open

token generation speed decreases when using grammar #810

abeiro opened this issue Apr 25, 2024 · 1 comment

Comments

@abeiro
Copy link

abeiro commented Apr 25, 2024

When a set of GBNF expressions is specified via the 'grammar' parameter through the API, the performance is affected.

For instance, I'm using this file: https://github.com/ggerganov/llama.cpp/blob/master/grammars/json.gbnf to ensure that the LLM always generates valid JSON content
The token-per-second ratio decreases from 70 to 30 when using Tesla V100-SXM2-16GB (compared to plain text output). Through testing, I noticed that an RTX3080 generates tokens at the same speed as an RTX3060 (both at 30 tokens per second). I find it curious that the speed is roughly the same across these three cards and when using grammar sampling.

When deactivating the parameter (not using grammar), the generation speeds do differ. This strikes me as odd. Is this normal? Is there another way to ensure that the LLM always returns valid JSON objects?

@LostRuins
Copy link
Owner

Yes, that is expected. Grammar sampling is rather expensive to do.

If the model is well tuned, it should be able to produce valid JSON without grammar, so an easier way would be to attempt generation and then try to parse the result, retrying on failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants