[Issue #820] Cannot convert token � (29333) to bytes: � for some model vocabularies when using llama.cpp #890

desaxce · 2024-05-12T09:56:38Z

Attempt at solving issue #820 - RuntimeError: Cannot convert token � (29333) to bytes: �

With additional characters in the gpt2_bytes_to_unicode() map, I don't get errors anymore.
However, it doesn't seem good to add ad-hoc characters like that. We should test on many more kinds of tokenizers.

Benchmarks pass:

Side note: Had to switch to Linux for development because of the vllm dependency, see their Requirements.

rlouf · 2024-05-13T09:09:29Z

Thank you for opening a PR! Could we add a small test (maybe with the llama.cpp integration tests) that fails on main but passes with this change?

rlouf · 2024-05-18T07:49:28Z

This issue was addressed in #892, closing. Thank you for opening a PR!

desaxce and others added 2 commits May 12, 2024 11:04

fix: add space and replacement character to bytes to unicode map

9818e89

fix: add Ń character

41cac07

desaxce marked this pull request as draft May 12, 2024 15:33

lapp0 mentioned this pull request May 15, 2024

Circumvent Broken llama.cpp Pre-Tokenizer #892

Merged

Merge branch 'main' into desaxce/cannot-convert-replacement-character

15d3ad1

rlouf closed this May 18, 2024

desaxce deleted the desaxce/cannot-convert-replacement-character branch May 18, 2024 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue #820] Cannot convert token � (29333) to bytes: � for some model vocabularies when using llama.cpp #890

[Issue #820] Cannot convert token � (29333) to bytes: � for some model vocabularies when using llama.cpp #890

desaxce commented May 12, 2024

rlouf commented May 13, 2024

rlouf commented May 18, 2024

[Issue #820] Cannot convert token � (29333) to bytes: � for some model vocabularies when using llama.cpp #890

[Issue #820] Cannot convert token � (29333) to bytes: � for some model vocabularies when using llama.cpp #890

Conversation

desaxce commented May 12, 2024

rlouf commented May 13, 2024

rlouf commented May 18, 2024