Implement prompt/generation alignment #161

rlouf · 2023-06-22T16:18:05Z

Guidance implements a method called token healing, which consists in correcting for the quirks introduced by modern encodings like BPE. See this notebook for a thorough explanation of why this is necessary. The implementation for Transformers models is here.

This consists in backtracking one or several tokens and start generation by imposing that we reproduce the text that corresponds to the removed tokens. This can be integrated in the __call__ method of the Sequence class.

The text was updated successfully, but these errors were encountered:

rlouf · 2023-07-15T17:40:29Z

First thoughts on how this could be achieved. Let's consider the prompt This is a prompt .

We loop over the entire vocabulary and use partial matching to determine the tokens that cross the prompt boundaries, i.e. the ones that start with a substring of the prompt but are longer than this substring. This gives us a list of potential tokens.
For each of these tokens, match the part that is within the boundary and strip the prompt string.
Tokenize the prompt to which we have removed the parts within the boundary. Generate a mask that only allows the previously found token(s).

What do we do when (2) gives several matches?

arunpatro · 2023-07-15T21:37:38Z

In case of multiple matches, we can rank by the integer id of the token(s). A smaller integer implies more frequently occurring token in the BPE tokenizing process.

How about evaluating the log probs of the sequence? I think smaller integer should have higher log prob.

rlouf · 2023-07-17T10:32:19Z

Here is a quick outline of a solution that uses outlines.text.parsing.find_partial_matches and only loops through the vocabulary once:

from outlines.text.parsing import find_partial_matches
from transformers import AutoTokenizer
import interegular
import collections

tokenizer = AutoTokenizer.from_pretrained("gpt2")
vocabulary = tokenizer.get_vocab()

# The BPE tokenizer encodes spaces as Ġ so anything that is based on string matching
# will require us to do this.
sorted_vocabulary = [
    tokenizer.convert_tokens_to_string([k]) for k, v in sorted(vocabulary.items(), key=lambda kv: kv[1])
]

prompt = "This is a new day"
fsm = interegular.parse_pattern(prompt).to_fsm()

tokenized_prompt = tokenizer.encode(prompt)
prompt_tokens = [tokenizer.decode(t) for t in tokenized_prompt]
token_idx_in_prompt = [prompt.rfind(pt) for pt in prompt_tokens]  # fails if appears several times do something better by tracking the position in the prompt string

found = defaultdict(list)
for vocab_str in sorted_vocabulary:
    pmatch = find_partial_matches(
        fsm,
        vocab_str
    )
    if pmatch != set():
        end_idx, states = pmatch.pop()  # We need to loop over the matches instead
        if end_idx is not None and states[-1] == len(prompt):
            if states[0] in prompt_idx:
                found[token_idx_in_prompt.index(states[0])].append(vocab_str)

print(found)
# {4: [' day', ' days', ' daylight', ' daytime']})

We then need to back one token, generate a next token using the masks we can build from the above list and then generate the sequence as usual.

Now I understand my intuition behind leaving a space at the end of my prompt: this tells the model that it shouldn't complete the word. As you can see, the found dict contains not only " day" but also completions like " daytime".

However, if you run the code above for the prompt "This is a new day ", you can see that it backs up one token (the whitespace), and suggests the ~33,000 tokens that start with a whitespace as potential continuations.

rlouf · 2023-07-17T15:38:05Z

Another fun one:

import outlines.models as models
import outlines.text.generate as generate

model = models.transformers("gpt2")

prompt = "Is regex-guided generation useful? "
unguided = generate.continuation(model, max_tokens=30)(prompt)
guided = generate.regex(model, r"(Yes|No)", max_tokens=30)(prompt)

print(guided)
# Is regex-guided generation useful? No

prompt = "Is regex-guided generation useful?"
guided = generate.regex(model, r"(Yes|No)", max_tokens=30)(prompt)
print(guided)
# Is regex-guided generation useful?No

prompt = "Is regex-guided generation useful?"
guided = generate.regex(model, r"( )?(Yes|No)", max_tokens=30)(prompt)
print(guided)
# Is regex-guided generation useful? No

print([k for k in model.tokenizer.vocabulary.keys() if k.endswith("Yes")])
# ['Yes', 'ĠYes']

The "right" prompting here would be to leave a whitespace after the question, since we don't want "useful?" to be completed. However, " Yes" might be the most likely answer, as this is typically how the model would have tokenized "Is regex-generation useful? Yes". So we need to back one token and add this character to the regex. In this case, we should be able to match " Yes", " No" and also a succession of whitespace and "Yes" or "No".

RobinPicard · 2024-01-14T21:57:07Z

Do you think it would be right strategy to still create the regex_fsm during the initialization of RegexFSM but then to only create the states_to_token_maps after the generation function is called with the prompt (we would first modify the regex_fsm to include the states corresponding to the last token of the prompt)? The downfall of this seems to be that we're adding some overhead to calling the generation function

rlouf added text Linked to text generation enhancement labels Jun 22, 2023

rlouf added this to the 0.1 milestone Jul 13, 2023

rlouf mentioned this issue Jul 17, 2023

Stop generation with Continuation when a specific string was generated #187

Merged

rlouf changed the title ~~Implement token healing~~ Implement token/prompt alignment Jul 17, 2023

rlouf mentioned this issue Jul 19, 2023

Align prompt with tokens #201

Closed

rlouf mentioned this issue Aug 16, 2023

Slow Index Building #226

Closed

RobinPicard linked a pull request Jan 11, 2024 that will close this issue

Implement prompt/generation alignment #531

Open

rlouf linked a pull request Jan 27, 2024 that will close this issue

Implement prompt/generation alignment #531

Open

rlouf changed the title ~~Implement token/prompt alignment~~ Implement prompt/generation alignment Feb 11, 2024

shawnz mentioned this issue Mar 1, 2024

Improve HF tokenization hack to cover multiple special tokens guidance-ai/guidance#649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement prompt/generation alignment #161

Implement prompt/generation alignment #161

rlouf commented Jun 22, 2023 •

edited

rlouf commented Jul 15, 2023 •

edited

arunpatro commented Jul 15, 2023

rlouf commented Jul 17, 2023 •

edited

rlouf commented Jul 17, 2023 •

edited

RobinPicard commented Jan 14, 2024

Implement prompt/generation alignment #161

Implement prompt/generation alignment #161

Comments

rlouf commented Jun 22, 2023 • edited

rlouf commented Jul 15, 2023 • edited

arunpatro commented Jul 15, 2023

rlouf commented Jul 17, 2023 • edited

rlouf commented Jul 17, 2023 • edited

RobinPicard commented Jan 14, 2024

rlouf commented Jun 22, 2023 •

edited

rlouf commented Jul 15, 2023 •

edited

rlouf commented Jul 17, 2023 •

edited

rlouf commented Jul 17, 2023 •

edited