Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to write custom Wordpiece class? #1525

Open
xinyinan9527 opened this issue May 9, 2024 · 0 comments
Open

How to write custom Wordpiece class? #1525

xinyinan9527 opened this issue May 9, 2024 · 0 comments

Comments

@xinyinan9527
Copy link

xinyinan9527 commented May 9, 2024

My aim is get the rwkv5 model‘s "tokenizer.json",but it implemented through slow tokenizer(class Pretrainedtokenizer).
I want to convert "slow tokenizer" to "fast tokenizer",it needs to use "tokenizer = Tokenizer(Wordpiece())",but rwkv5 has it‘s own Wordpiece file.
So I want to create a custom Wordpiece

the code is here

from tokenizers.models import Model
class MyWordpiece(Model):
    def __init__(self,vocab,unk_token):
        self.vocab = vocab
        self.unk_token = unk_token



test = MyWordpiece('./vocab.txt',"<s>")
Traceback (most recent call last):
  File "test.py", line 78, in <module>
    test = MyWordpiece('./vocab.txt',"<s>")
TypeError: Model.__new__() takes 0 positional arguments but 2 were given
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant