Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue #820] Cannot convert token � (29333) to bytes: � for some model vocabularies when using llama.cpp #890

Conversation

desaxce
Copy link

@desaxce desaxce commented May 12, 2024

Attempt at solving issue #820 - RuntimeError: Cannot convert token � (29333) to bytes: �

With additional characters in the gpt2_bytes_to_unicode() map, I don't get errors anymore.
However, it doesn't seem good to add ad-hoc characters like that. We should test on many more kinds of tokenizers.

Benchmarks pass:
image

Side note: Had to switch to Linux for development because of the vllm dependency, see their Requirements.

@desaxce desaxce marked this pull request as draft May 12, 2024 15:33
@rlouf
Copy link
Member

rlouf commented May 13, 2024

Thank you for opening a PR! Could we add a small test (maybe with the llama.cpp integration tests) that fails on main but passes with this change?

@rlouf
Copy link
Member

rlouf commented May 18, 2024

This issue was addressed in #892, closing. Thank you for opening a PR!

@rlouf rlouf closed this May 18, 2024
@desaxce desaxce deleted the desaxce/cannot-convert-replacement-character branch May 18, 2024 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants