You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a set of GBNF expressions is specified via the 'grammar' parameter through the API, the performance is affected.
For instance, I'm using this file: https://github.com/ggerganov/llama.cpp/blob/master/grammars/json.gbnf to ensure that the LLM always generates valid JSON content
The token-per-second ratio decreases from 70 to 30 when using Tesla V100-SXM2-16GB (compared to plain text output). Through testing, I noticed that an RTX3080 generates tokens at the same speed as an RTX3060 (both at 30 tokens per second). I find it curious that the speed is roughly the same across these three cards and when using grammar sampling.
When deactivating the parameter (not using grammar), the generation speeds do differ. This strikes me as odd. Is this normal? Is there another way to ensure that the LLM always returns valid JSON objects?
The text was updated successfully, but these errors were encountered:
Yes, that is expected. Grammar sampling is rather expensive to do.
If the model is well tuned, it should be able to produce valid JSON without grammar, so an easier way would be to attempt generation and then try to parse the result, retrying on failure.
When a set of GBNF expressions is specified via the 'grammar' parameter through the API, the performance is affected.
For instance, I'm using this file: https://github.com/ggerganov/llama.cpp/blob/master/grammars/json.gbnf to ensure that the LLM always generates valid JSON content
The token-per-second ratio decreases from 70 to 30 when using Tesla V100-SXM2-16GB (compared to plain text output). Through testing, I noticed that an RTX3080 generates tokens at the same speed as an RTX3060 (both at 30 tokens per second). I find it curious that the speed is roughly the same across these three cards and when using grammar sampling.
When deactivating the parameter (not using grammar), the generation speeds do differ. This strikes me as odd. Is this normal? Is there another way to ensure that the LLM always returns valid JSON objects?
The text was updated successfully, but these errors were encountered: