New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Support inference on raw text input in main and server. #6982
Comments
The existing logic for adding BOS token with SPM tokenizers is this: Lines 12665 to 12669 in 9c67c27
So if you call
This sounds like a model problem - don't know what we can do in |
I see, but I'm not using it as a library. I'm just using it via E.g. with I'm making yet another interface that will run in the browser (mostly for fun) and I wanted to allow total control over the template used (since I have my own Jinja template parser). Basically to maximize someones ability to experiment with these models (and maybe to help with finding errors related to the templates used). |
I'm for this, especially since I've been training gguf models from scratch lately. Working on a custom grammar dataset at the moment. I think theres a way to toggle these though. Will take me time to investigate as I'm really into the dataset creation thing at the moment. The token control will be crucial when I begin experimenting with finetuning for conversational formats. It will also be invaluable for model genralization and performance testing down the line. |
EDIT: In my reply below I also mention my wish for the
/completion
endpoint.My use case is that I want to do inference with raw input, meaning that I will parse the Jinja
tokenizer.chat_template
myself (which is stored in GGUF files).With my Jinja parser I can then apply a system prompt (when supported) and a conversation history and get the resulting text to do the inference on.
But the current problem with this method is that llama.cpp forcefully starts with the BOS token. Which the template will also add!! Hence the text is going to start with two BOS tokens then.
I could potentially just remove the BOS token from my text then, but please see my ramblings below.
Ramblings:
Sometimes a template doesn't even use a BOS token even if one is defined in the metadata.
Also I sometimes see the field
tokenizer.ggml.add_bos_token: False
, what does it mean? I see llama.cpp still adding it in such a case.Here is some data to look at:
Notice how when add_bos_token is false it is not used in the template?
Although I saw one model who did just that... (by mistake?)
So all this is quite confusing!
If we want the best possible result then I want control over these things myself, because llama.cpp can do mistakes when it comes to such templates and a model will expect what it expects...
The text was updated successfully, but these errors were encountered: