Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Support inference on raw text input in main and server. #6982

Open
JoakimCh opened this issue Apr 29, 2024 · 3 comments
Open

[feature] Support inference on raw text input in main and server. #6982

JoakimCh opened this issue Apr 29, 2024 · 3 comments

Comments

@JoakimCh
Copy link

JoakimCh commented Apr 29, 2024

EDIT: In my reply below I also mention my wish for the /completion endpoint.

My use case is that I want to do inference with raw input, meaning that I will parse the Jinja tokenizer.chat_template myself (which is stored in GGUF files).

With my Jinja parser I can then apply a system prompt (when supported) and a conversation history and get the resulting text to do the inference on.

But the current problem with this method is that llama.cpp forcefully starts with the BOS token. Which the template will also add!! Hence the text is going to start with two BOS tokens then.

I could potentially just remove the BOS token from my text then, but please see my ramblings below.

Ramblings:

Sometimes a template doesn't even use a BOS token even if one is defined in the metadata.

Also I sometimes see the field tokenizer.ggml.add_bos_token: False, what does it mean? I see llama.cpp still adding it in such a case.

Here is some data to look at:

## MODEL Phi-3-mini-4k-instruct-q4.gguf ##

## TOKEN METADATA ##
{
  'tokenizer.ggml.add_bos_token': 1,
  'tokenizer.ggml.add_eos_token': 0,
  'tokenizer.ggml.bos_token': '<s>',
  'tokenizer.ggml.eos_token': '<|endoftext|>',
  'tokenizer.ggml.unknown_token': '<unk>',
  'tokenizer.ggml.padding_token': '<|endoftext|>'
}

## TEMPLATE DEFINITION ##
{{ bos_token }}{% for message in messages %}{{'<|' + message['role'] + '|>' + '
' + message['content'] + '<|end|>
' }}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
' }}{% else %}{{ eos_token }}{% endif %}

## TEMPLATE DEMO ##
<s><|system|>
custom system prompt<|end|>
<|user|>
hi from user<|end|>
<|assistant|>
hello from assistant<|end|>
<|user|>
goodbye from user<|end|>
<|assistant|>


## MODEL codeqwen-1_5-7b-chat-q5_k_m.gguf ##

## TOKEN METADATA ##
{
  'tokenizer.ggml.add_bos_token': 0,
  'tokenizer.ggml.add_eos_token': 0,
  'tokenizer.ggml.bos_token': '<|endoftext|>',
  'tokenizer.ggml.eos_token': '<|im_end|>',
  'tokenizer.ggml.unknown_token': '<unk>',
  'tokenizer.ggml.padding_token': '<fim_pad>'
}

## TEMPLATE DEFINITION ##
{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful assistant<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}

## TEMPLATE DEMO ##
<|im_start|>system
custom system prompt<|im_end|>
<|im_start|>user
hi from user<|im_end|>
<|im_start|>assistant
hello from assistant<|im_end|>
<|im_start|>user
goodbye from user<|im_end|>
<|im_start|>assistant

Notice how when add_bos_token is false it is not used in the template?
Although I saw one model who did just that... (by mistake?)
So all this is quite confusing!

If we want the best possible result then I want control over these things myself, because llama.cpp can do mistakes when it comes to such templates and a model will expect what it expects...

@ggerganov
Copy link
Owner

The existing logic for adding BOS token with SPM tokenizers is this:

llama.cpp/llama.cpp

Lines 12665 to 12669 in 9c67c27

if (add_special && vocab.special_add_bos != 0) {
GGML_ASSERT(vocab.special_bos_id != -1);
output.push_back(vocab.special_bos_id);
}

So if you call llama_tokenize() with add_special == true and the model defines tokenizer.ggml.add_bos_token == true then a BOS token will be inserted automatically.

Sometimes a template doesn't even use a BOS token even if one is defined in the metadata.

This sounds like a model problem - don't know what we can do in llama.cpp in this regard

@JoakimCh
Copy link
Author

I see, but I'm not using it as a library. I'm just using it via main and server and wanted better control.

E.g. with server using the /completion endpoint I would like an option to just send a "raw prompt" that is never preceded by a BOS or a system prompt. Btw, is there a default system prompt used if I don't set one?

I'm making yet another interface that will run in the browser (mostly for fun) and I wanted to allow total control over the template used (since I have my own Jinja template parser). Basically to maximize someones ability to experiment with these models (and maybe to help with finding errors related to the templates used).

@JoakimCh JoakimCh changed the title [feature] Support a flag for not adding the BOS token? [feature] Support inference on raw text input in main and server. Apr 30, 2024
@teleprint-me
Copy link
Contributor

I'm for this, especially since I've been training gguf models from scratch lately. Working on a custom grammar dataset at the moment.

I think theres a way to toggle these though.

Will take me time to investigate as I'm really into the dataset creation thing at the moment.

The token control will be crucial when I begin experimenting with finetuning for conversational formats.

It will also be invaluable for model genralization and performance testing down the line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants