Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: possibility of customizable chat template? #5922

Open
ngxson opened this issue Mar 7, 2024 · 16 comments
Open

Server: possibility of customizable chat template? #5922

ngxson opened this issue Mar 7, 2024 · 16 comments
Labels
enhancement New feature or request

Comments

@ngxson
Copy link
Collaborator

ngxson commented Mar 7, 2024

Motivation

While we already have support for known chat templates, it sometimes not enough for users who want to:

  • Use their own fine tuned model
  • Or, use a model that does not have Jinja template

The problem is that other implementations of chat template out there are also quite messy, for example:

  • Jinja tempate: as discussed in server : improvements and maintenance #4216 , it's too complicated to add a such parser into the code base of llama.cpp
  • The format of ollama requires a parser, and it's not very flexible for future usages
  • LM Studio format does not requires parser, but lack support for multi roles (we currently have system - user - assistant, but technically it's possible to have custom roles like database, function, search-engine,...)

Possible implementation

My idea is to have a simple JSON format that take into account all roles:

{
  "system": {
    "prefix": "<|system|>\n",
    "postfix": "<|end|>\n"
  },
  "user": {
    "prefix": "<|user|>\n",
    "postfix": "<|end|>\n"
  },
  "assistant": {
    "prefix": "<|assistant|>\n",
    "postfix": "<|end|>\n"
  },
  "_stop": ["<|end|>"],
  "_generation": "<|assistant|>\n",
}

User can specify the custom template via --chat-template-file ./my_template.json

The cpp code will be as simple as:

std::string apply_custom_template(json messages, json tmpl) {
  std::stringstream ss;
  for (auto & msg : messages) {
    json t = tmpl[msg["role"]];
    ss << t["prefix"] << msg["content"] << t["postfix"];
  }
  ss << tmpl["_generation"]; // add generation prompt
  return ss.str();
}

NOTE: This function does not take into account models that does not support system prompt for now, but this function can be added in the future, maybe toggle via an attribute inside json "system_inside_user_message": true

Ref:

@ngxson ngxson added the enhancement New feature or request label Mar 7, 2024
@ggerganov
Copy link
Owner

Users that want to support a certain template should open a PR and implement it in the framework that we already have

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 7, 2024

Users that want to support a certain template should open a PR and implement it in the framework that we already have

Yeah I thought that would be ideal to do so, but sometimes that's not even enough: maybe user want to fine tune and try out one single template?

Another problem is that currently there're models don't have jinja template, for example the old alpaca ### Instruction (that I don't feel like it should exist in our code base - another problem is that we cannot support stop word for this template). Yet some users still using that.

The current proposed solution is to use /completions with a custom proxy, I've already mentioned it in wiki page

My proposal in this PR is not complete, so I'll let it here to see if anyone come up with another use case (or another idea) that we've never though about for example.

@teleprint-me
Copy link
Contributor

Yeah, this is why I said templates should be the responsibility of the user.

It's why I always use the completions endpoint and avoid any chat template enforcement. The problem is simple to understand once you understand how a tokenizer is created and trained. ANY token can be used. ANY TOKEN. This ambiguity is problematic for requiring even a loose set of rules.

This problem will exist even if the industry agrees upon a standard. ChatML is nice, but it's only a single use case and doesn't solve the wider issue at hand. This also makes completions more valuable because they're the most flexible. Completions also set the foundation or stage for chat fine-tuning. It sucks that OpenAI took it down, but it's what I always use when I use llama.cpp.

Completions are literally the only reason I use llama.cpp. There's so much more flexibility that way. Just put the responsibility on the user, end of discussion. This isn't a conclusion I came to lightly. It took time, research, and experimentation to figure this out. This is why I backed and proposed this idea.

This is the most viable solution for the interim, and even then, this solution fails miserably with FIM and other templates. It's not that I think this is an impossible problem, but it is the type of problem that will create an entirely different set of problems that compound one another and eventually becomes a bottleneck in the best case scenario.

I really do understand the temptation here, but it's best avoided.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 10, 2024

I really do understand the temptation here, but it's best avoided.

Thanks for your input. For clarification, I'm not saying that my proposal solves all the issues we have with chat templates in general. If I was that confident, I could have just made a PR instead.

I also not assuming that supporting custom chat template is a good idea or not. I'm still learning here. I understand your point. It's a valid reason not to have this feature and I appreciate that.

However, I will still keep this issue open for a while to collect some feedback, maybe will be helpful if we change the decision in the future.

@teleprint-me
Copy link
Contributor

teleprint-me commented Mar 12, 2024

I think this is the middle of the road solution which is good.

I just keep reiterating it because the tokens are dictated by the tokenizer and the settings used to train the tokenizer. Then the chat template is fine-tuned with any added tokens.

All of the tokens for the model are (usually, but unfortunately not always) in there, e.g. tokenizer.model or tokenizer.json. It was really interesting and fun learning how this worked.

So, to clarify, I support your proposal. Once the tokenizer is understood, managing the templates for the model becomes more intuitive.

A really great way to get a feel for it is with completions.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 14, 2024

I came across ollama/ollama#1977 and feeling like we're in the middle of a "war of template". You're right @teleprint-me , there's temptation, but better to avoid it at least in this stage.

Edit: Still I'm feeling quite lucky because in llama.cpp we have test-chat-template.cpp to make sure that the template works 100% the same with the original jinja version.

@teleprint-me
Copy link
Contributor

I'd love to have it automated, it would be great. I forgot where I stated it, but I remember reiterating that this is similar to "Hilbert's paradox of the Grand Hotel" which "is a thought experiment which illustrates a counterintuitive property of infinite sets".

This issue arises because of the desire to support many models with a variety of templates. Model developers can choose to set up the template however they'd like and so can fine-tuners.

The moment you begin baking in special tokens, chat templates, and more, is the moment you've bound yourself to an uncanny solution that exponentially becomes more difficult to manage over time. You'll always need to accommodate another "guest".

The simplest solution is to create an API or Framework that developers can plugin to. @ggerganov actually suggested this same solution awhile ago. I recommended this solution multiple times. I've been advocating to place the chat template as the responsibility of the user. My rationale is to keep the code and API simple and digestible.

I'm confident that there is a way to find a middle ground, but we'll need to work towards that middle ground. I think your idea is actually sound and the reason is because it's simple and flexible. The motto around here seems to be to not over engineer, but supporting chat templates will require much more than over engineering and this doesn't include the maintenance that will ensue as a result. It has technical debt written all over it.

I think using the prefix and postfix for prompts is probably the best we can do until templates become solidified. It's still early and we're just getting started. It's better to observe and learn as we progress. Once a pattern emerges, we can use that as an anchor.

@teleprint-me
Copy link
Contributor

teleprint-me commented Mar 14, 2024

As an aside, I'd love to build a custom tokenizer for llama.cpp. I think it would be great. We could use it for training and fine-tuning. I haven't looked at the backend lately, but back-propagation would obviously help for updating the weights. What would be really neat is training and fine-tuning quants. If I remember correctly, the softmax outputs the logits and we just update backward pass with cross-entropy (feel free to correct me, I'm still learning). Now that would be really cool :)

@kaizau
Copy link
Contributor

kaizau commented Apr 20, 2024

Re: #6726 (comment)

@ngxson The main problem is that even with this level of flexibility, some templates can't be supported without doing some code logic (for example, llama 2 template [INST] with <> system message).

My hunch is that code logic in templates can still be avoided, if the configuration provides enough flexibility. For example, providing an alternate template based on message index:

{
  "system": {
    "prefix": "<s>[INST] <<SYS>>\n",
    "postfix": "\n<<SYS>>\n\n"
  },
  "user_1": {
    "prefix": "",
    "postfix": " [/INST] "
  },
  "user": {
    "prefix": "<s>[INST] ",
    "postfix": " [/INST] "
  },
  "assistant": {
    "prefix": "",
    "postfix": " </s>"
  }
}

Or more generally:

{
  ...
  "user": {
    "prefix": "<s>[INST] ",
    "postfix": " [/INST] "
    "override": {
      "1": {
        "prefix": ""
      }
    }
  },
  ...
}

Though I wonder if a 1-off workaround for first user message might even be enough.

@kaizau
Copy link
Contributor

kaizau commented Apr 20, 2024

Re: #6726 (comment)

@bullno1 Sounds cool and i'd say take it further, why even template or search&replace within a role? Just change it to "prefix" and "suffix":

Might just be me, but I slightly prefer the aesthetic / concise legibility of seeing the entire message in context:

{
  "system": "<s>[INST] <<SYS>>\n{{content}}<<SYS>>\n\n",
  "user_1": "{{content}} [/INST] ",
  "user": "<s>[INST] {{content}} [/INST] ",
  "assistant": "{{content}} </s>"
}

Not a big deal, because both are more legible than a string of escaped Jinja.

As for injection risk, this wouldn't need to execute code — just do a string replacement. Maybe I'm overlooking something here?

@kaizau
Copy link
Contributor

kaizau commented Apr 20, 2024

@teleprint-me Seems like chat templates brushes up against a macro question around the ideal scope of llama.cpp and this server "example" in general.

But whether the chat endpoint is here or elsewhere, configurable chat templates is a very natural extension. Anything else is an incomplete solution which forces clients (or users) to implement their own workarounds on completions if they want to support the vast and growing number of templates.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 20, 2024

Might just be me, but I slightly prefer the aesthetic / concise legibility of seeing the entire message in context

@kaizau I personally prefer having postfix/prefix explicitly, since it makes the cpp code more readable. I think the format you proposed is more suitable for higher level programming languages, since the parser can be just one or two line of codes.

@teleprint-me
Copy link
Contributor

@kaizau I agree with your assessment.

@bullno1
Copy link
Contributor

bullno1 commented Apr 20, 2024

Re: #6726 (comment)

@bullno1 Sounds cool and i'd say take it further, why even template or search&replace within a role? Just change it to "prefix" and "suffix":

Might just be me, but I slightly prefer the aesthetic / concise legibility of seeing the entire message in context:

{
  "system": "<s>[INST] <<SYS>>\n{{content}}<<SYS>>\n\n",
  "user_1": "{{content}} [/INST] ",
  "user": "<s>[INST] {{content}} [/INST] ",
  "assistant": "{{content}} </s>"
}

Not a big deal, because both are more legible than a string of escaped Jinja.

As for injection risk, this wouldn't need to execute code — just do a string replacement. Maybe I'm overlooking something here?

The injection risk I was talking about is more about user input containing special tokens like: <|start_of_turn|>, <|eot_id|>...
When we tokenize the prefix/suffix markers separately, user message can be tokenized with parse_special=false and those special tags will appear as literal strings instead of special tokens.

When templated into a string parse_special has to be true and it's easy to put words into the other role's mouth.

@hanishkvc
Copy link
Contributor

Please do have a look at the code in the below PR. Around the time when llama3 came out, I had a need to look at llama.cpp and inturn I worked on below, to try and see if one can have a generic flow which is driven by a config file to try and accomodate different modes/chat-handshake-template-standards in a generic and flexible way. The idea being that if a new template standard is added during finetuning of a model or if a new model or standard comes out, but which follows a sane convention matching the commonality that I have noticed across many models/standards, then the generic code flow itself can be used, by just updating the config file, without having to add a custom template block.

This inturn can be used by example/main, example/server, as well as other users of llama.cpp library. Currently main has been patched to use this config file based flow inturn piggy backing on its existing interactive mode and its in-prefix, in-suffix, antiprompt to a great extent.

#6834

Based on some minimal testing at my end, I seem to be able to handle the nitty gritties of around 8(+1) model using this generic code + config file based flow.

Currently json format is used for the config file, but if needed can be switched to a simpler text based config file, to avoid users of the llama.cpp library from needing to depend on json library.

The generic code flow uses a concept similar to what this PR is also thinking ie a generic code flow driven by a config file.

Also the generic flow additionally takes care of

  • the conditionality that is seen across few different models wrt tagging of the system-message + 1st user message flow by using related generic flags and also how system-suffix, system-end, user-begin and user-prefix are setup.

  • the need to differentiate between role-begin, role-prefix, role-suffix and role-end tokens wrt each of the role. And inturn the variation in their insertion or not across different models, but done in a simple and generic way, by just allowing for each of these to be set to some specific string or left as empty string wrt each of the role, as needed by that specific model.

You can look at the examples/chaton_meta.json which has entries for the 8(+1) models/standard, which I have tested with my patch.

@JoakimCh
Copy link

teleprint-me: Yeah, this is why I said templates should be the responsibility of the user.

Agreed, which I asked about here issues/6982.

As ngxson pointed out "the code is so simple", we can write it ourselves in whatever frontend we use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants