[feature] Support inference on raw text input in main and server. #6982

JoakimCh · 2024-04-29T13:27:19Z

EDIT: In my reply below I also mention my wish for the /completion endpoint.

My use case is that I want to do inference with raw input, meaning that I will parse the Jinja tokenizer.chat_template myself (which is stored in GGUF files).

With my Jinja parser I can then apply a system prompt (when supported) and a conversation history and get the resulting text to do the inference on.

But the current problem with this method is that llama.cpp forcefully starts with the BOS token. Which the template will also add!! Hence the text is going to start with two BOS tokens then.

I could potentially just remove the BOS token from my text then, but please see my ramblings below.

Ramblings:

Sometimes a template doesn't even use a BOS token even if one is defined in the metadata.

Also I sometimes see the field tokenizer.ggml.add_bos_token: False, what does it mean? I see llama.cpp still adding it in such a case.

Here is some data to look at:

## MODEL Phi-3-mini-4k-instruct-q4.gguf ##

## TOKEN METADATA ##
{
  'tokenizer.ggml.add_bos_token': 1,
  'tokenizer.ggml.add_eos_token': 0,
  'tokenizer.ggml.bos_token': '<s>',
  'tokenizer.ggml.eos_token': '<|endoftext|>',
  'tokenizer.ggml.unknown_token': '<unk>',
  'tokenizer.ggml.padding_token': '<|endoftext|>'
}

## TEMPLATE DEFINITION ##
{{ bos_token }}{% for message in messages %}{{'<|' + message['role'] + '|>' + '
' + message['content'] + '<|end|>
' }}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>
' }}{% else %}{{ eos_token }}{% endif %}

## TEMPLATE DEMO ##
<s><|system|>
custom system prompt<|end|>
<|user|>
hi from user<|end|>
<|assistant|>
hello from assistant<|end|>
<|user|>
goodbye from user<|end|>
<|assistant|>


## MODEL codeqwen-1_5-7b-chat-q5_k_m.gguf ##

## TOKEN METADATA ##
{
  'tokenizer.ggml.add_bos_token': 0,
  'tokenizer.ggml.add_eos_token': 0,
  'tokenizer.ggml.bos_token': '<|endoftext|>',
  'tokenizer.ggml.eos_token': '<|im_end|>',
  'tokenizer.ggml.unknown_token': '<unk>',
  'tokenizer.ggml.padding_token': '<fim_pad>'
}

## TEMPLATE DEFINITION ##
{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful assistant<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}

## TEMPLATE DEMO ##
<|im_start|>system
custom system prompt<|im_end|>
<|im_start|>user
hi from user<|im_end|>
<|im_start|>assistant
hello from assistant<|im_end|>
<|im_start|>user
goodbye from user<|im_end|>
<|im_start|>assistant

Notice how when add_bos_token is false it is not used in the template?
Although I saw one model who did just that... (by mistake?)
So all this is quite confusing!

If we want the best possible result then I want control over these things myself, because llama.cpp can do mistakes when it comes to such templates and a model will expect what it expects...

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-04-30T09:39:51Z

The existing logic for adding BOS token with SPM tokenizers is this:

llama.cpp/llama.cpp

Lines 12665 to 12669 in 9c67c27

    
           if (add_special && vocab.special_add_bos != 0) { 
        
               GGML_ASSERT(vocab.special_bos_id != -1); 
        
               output.push_back(vocab.special_bos_id); 
        
           }

So if you call llama_tokenize() with add_special == true and the model defines tokenizer.ggml.add_bos_token == true then a BOS token will be inserted automatically.

Sometimes a template doesn't even use a BOS token even if one is defined in the metadata.

This sounds like a model problem - don't know what we can do in llama.cpp in this regard

JoakimCh · 2024-04-30T15:26:01Z

I see, but I'm not using it as a library. I'm just using it via main and server and wanted better control.

E.g. with server using the /completion endpoint I would like an option to just send a "raw prompt" that is never preceded by a BOS or a system prompt. Btw, is there a default system prompt used if I don't set one?

I'm making yet another interface that will run in the browser (mostly for fun) and I wanted to allow total control over the template used (since I have my own Jinja template parser). Basically to maximize someones ability to experiment with these models (and maybe to help with finding errors related to the templates used).

teleprint-me · 2024-04-30T21:32:04Z

I'm for this, especially since I've been training gguf models from scratch lately. Working on a custom grammar dataset at the moment.

I think theres a way to toggle these though.

Will take me time to investigate as I'm really into the dataset creation thing at the moment.

The token control will be crucial when I begin experimenting with finetuning for conversational formats.

It will also be invaluable for model genralization and performance testing down the line.

JoakimCh changed the title ~~[feature] Support a flag for not adding the BOS token?~~ [feature] Support inference on raw text input in main and server. Apr 30, 2024

JoakimCh mentioned this issue May 12, 2024

Server: possibility of customizable chat template? #5922

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Support inference on raw text input in main and server. #6982

[feature] Support inference on raw text input in main and server. #6982

JoakimCh commented Apr 29, 2024 •

edited

ggerganov commented Apr 30, 2024

JoakimCh commented Apr 30, 2024

teleprint-me commented Apr 30, 2024

[feature] Support inference on raw text input in main and server. #6982

[feature] Support inference on raw text input in main and server. #6982

Comments

JoakimCh commented Apr 29, 2024 • edited

ggerganov commented Apr 30, 2024

JoakimCh commented Apr 30, 2024

teleprint-me commented Apr 30, 2024

JoakimCh commented Apr 29, 2024 •

edited