token generation speed decreases when using grammar #810

abeiro · 2024-04-25T05:51:49Z

When a set of GBNF expressions is specified via the 'grammar' parameter through the API, the performance is affected.

For instance, I'm using this file: https://github.com/ggerganov/llama.cpp/blob/master/grammars/json.gbnf to ensure that the LLM always generates valid JSON content
The token-per-second ratio decreases from 70 to 30 when using Tesla V100-SXM2-16GB (compared to plain text output). Through testing, I noticed that an RTX3080 generates tokens at the same speed as an RTX3060 (both at 30 tokens per second). I find it curious that the speed is roughly the same across these three cards and when using grammar sampling.

When deactivating the parameter (not using grammar), the generation speeds do differ. This strikes me as odd. Is this normal? Is there another way to ensure that the LLM always returns valid JSON objects?

LostRuins · 2024-04-25T15:18:45Z

Yes, that is expected. Grammar sampling is rather expensive to do.

If the model is well tuned, it should be able to produce valid JSON without grammar, so an easier way would be to attempt generation and then try to parse the result, retrying on failure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token generation speed decreases when using grammar #810

token generation speed decreases when using grammar #810

abeiro commented Apr 25, 2024

LostRuins commented Apr 25, 2024

token generation speed decreases when using grammar #810

token generation speed decreases when using grammar #810

Comments

abeiro commented Apr 25, 2024

LostRuins commented Apr 25, 2024